Pseudo-absence generation#

abil.pseudo_generation.generate_pseudo_absences(merged_df, missing_rows, env_vars, species_cols, absence_ratio=1, aoa_threshold=0.99, min_presence=100, allow_replacement=True)#

Generate pseudo-absences for each species at a specified ratio to presences.

Pseudo-absences are sampled from rows without observations that fall outside the area of applicability estimated from each species’ observed rows. area_of_applicability uses 0 for inside AOA and 1 for outside AOA, so this function samples candidates where the AOA mask equals 1.

Parameters:
  • merged_df (pandas.DataFrame) – Merged observation and environmental data.

  • missing_rows (pandas.DataFrame) – Environmental rows without observations that can be sampled as pseudo-absence candidates.

  • env_vars (list of str) – Environmental variable names. Coordinate variables named time, depth, lat, and lon are excluded from the AOA feature set.

  • species_cols (list of str) – Species column names.

  • absence_ratio (float, default=1) – Number of pseudo-absences to sample relative to the number of presences.

  • aoa_threshold (float or str, default=0.99) – Threshold passed to abil.analyze.area_of_applicability().

  • min_presence (int, default=100) – Minimum number of presence records required to generate pseudo-absences for a species.

  • allow_replacement (bool, default=True) – If True, sample outside-AOA candidate rows with replacement when there are fewer available candidate rows than requested by absence_ratio. This allows absence_ratio=1 to produce one pseudo-absence per presence even when outside-AOA candidates are scarce. If False, the number of pseudo-absences is capped by the number of unique outside-AOA candidate rows.

Returns:

merged_df plus sampled pseudo-absence rows. If no pseudo-absences can be generated, a copy of merged_df is returned.

Return type:

pandas.DataFrame