Utility Functions#

abil.utils.abbreviate_species(species_name)#

Abbreviate a species name by shortening the first word to its initial.

Parameters:: species_name (str) – Full species name.
Returns:: Abbreviated species name.
Return type:: str

abil.utils.do_nothing(x)#

Apply no transformation to the input values.

Parameters:: x (array-like) – Input values.
Returns:: y – Non-transformed values.
Return type:: array-like

abil.utils.example_data(y_name, n_samples=100, n_features=5, noise=0.1, train_to_predict_ratio=0.7, zero_to_non_zero_ratio=0.5, random_state=59)#

Generate training and prediction datasets with [‘lat’, ‘lon’, ‘depth’, ‘time’] indices. Includes zeros in the target and allows upsampling of zero values.

Parameters:

y_name (str) – Name of the target variable.
n_samples (int) – Total number of samples to generate (training + prediction).
n_features (int) – Number of features for the dataset.
noise (float) – Noise level for the regression data.
train_to_predict_ratio (float) – Ratio of training to prediction data.
zero_to_non_zero_ratio (float) – Ratio of zero to non-zero target values after upsampling.
random_state (int) – Random seed for reproducibility.

Returns:

Training feature dataset with MultiIndex. X_predict (pd.DataFrame): Prediction feature dataset with MultiIndex. y (pd.Series): Target variable for training dataset.

Return type:

X_train (pd.DataFrame)

abil.utils.find_optimal_threshold(model, X, y_test)#

Finds the optimal probability threshold for binary classification using the ROC curve and Youden’s Index.

Parameters:#

modelsklearn classifier: A fitted binary classification model
Xarray-like of shape (n_samples, n_features): Input features for the test or validation set.
y_testarray-like of shape (n_samples,): True binary labels for the test or validation set.

Returns:#

optimal_thresholdfloat: The optimal probability threshold for classifying a sample as present.

abil.utils.inverse_weighting(values)#

Compute inverse weighting for a list of values.

Parameters:: values (list of float) – Input values.
Returns:: Normalized inverse weights.
Return type:: list of float

abil.utils.is_xgboost_model(model)#: Recursively check if the model is an XGBoost model, even if it’s wrapped in a Pipeline or TransformedTargetRegressor. Uses getattr to check for XGBoost-specific attributes.

abil.utils.merge_obs_env(obs_path='../data/gridded_abundances.csv', env_path='../data/env_data.nc', env_vars=None, out_path='../data/obs_env.csv')#

Merge observational and environmental datasets based on spatial and temporal indices.

Parameters:

obs_path (str, default="../data/gridded_abundances.csv") – Path to observational data CSV.
env_path (str, default="../data/env_data.nc") – Path to environmental data NetCDF file.
env_vars (list of str, optional) – List of environmental variables to include in the merge.
out_path (str, default="../data/obs_env.csv") – Path to save the merged dataset.

Return type:

None

abil.utils.upsample(d, target, ratio=10)#

Upsample zero and non-zero observations in the dataset to balance classes.

Parameters:

d (pd.DataFrame) – Input dataframe.
target (str) – Target column for upsampling.
ratio (int, default=10) – Ratio of zeros to non-zero samples after upsampling.

Returns:

ix – Upsampled dataframe.

Return type:

pd.DataFrame

abil.utils.weighted_quantile(x, weights, q=0.5)#

Computes the weighted quantile(s) of a dataset.

Parameters:#

xarray-like of shape (n_samples,): The data for which to compute the quantile(s).
weightsarray-like of shape (n_samples,): The weights corresponding to each data point in x.
qfloat or array-like of floats, default=0.5: The quantile(s) to compute. Must be between 0 and 1. If an array is provided, the function will return the weighted quantiles for each value in q.

Returns:#

resultfloat or list of floats: The weighted quantile(s) corresponding to the input q. If q is a single float, the result is a single value. If q is an array-like, the result is a list of quantiles.”

abil.utils.xgboost_get_n_estimators(model)#: Recursively extract the n_estimators parameter from an XGBoost model, even if it’s wrapped in a Pipeline or TransformedTargetRegressor.