Utility Functions#
- abil.utils.abbreviate_species(species_name)#
Abbreviate a species name by shortening the first word to its initial.
- Parameters:
species_name (str) – Full species name.
- Returns:
Abbreviated species name.
- Return type:
str
- abil.utils.do_nothing(x)#
Apply no transformation to the input values.
- Parameters:
x (array-like) – Input values.
- Returns:
y – Non-transformed values.
- Return type:
array-like
- abil.utils.example_data(y_name, n_samples=100, n_features=5, noise=0.1, train_to_predict_ratio=0.7, zero_to_non_zero_ratio=0.5, random_state=59)#
Generate training and prediction datasets with [‘lat’, ‘lon’, ‘depth’, ‘time’] indices. Includes zeros in the target and allows upsampling of zero values.
- Parameters:
y_name (str) – Name of the target variable.
n_samples (int) – Total number of samples to generate (training + prediction).
n_features (int) – Number of features for the dataset.
noise (float) – Noise level for the regression data.
train_to_predict_ratio (float) – Ratio of training to prediction data.
zero_to_non_zero_ratio (float) – Ratio of zero to non-zero target values after upsampling.
random_state (int) – Random seed for reproducibility.
- Returns:
Training feature dataset with MultiIndex. X_predict (pd.DataFrame): Prediction feature dataset with MultiIndex. y (pd.Series): Target variable for training dataset.
- Return type:
X_train (pd.DataFrame)
- abil.utils.find_optimal_threshold(model, X, y_test)#
Finds the optimal probability threshold for binary classification using the ROC curve and Youden’s Index.
Parameters:#
- modelsklearn classifier
A fitted binary classification model
- Xarray-like of shape (n_samples, n_features)
Input features for the test or validation set.
- y_testarray-like of shape (n_samples,)
True binary labels for the test or validation set.
Returns:#
- optimal_thresholdfloat
The optimal probability threshold for classifying a sample as present.
- abil.utils.inverse_weighting(values)#
Compute inverse weighting for a list of values.
- Parameters:
values (list of float) – Input values.
- Returns:
Normalized inverse weights.
- Return type:
list of float
- abil.utils.is_xgboost_model(model)#
Recursively check if the model is an XGBoost model, even if it’s wrapped in a Pipeline or TransformedTargetRegressor. Uses getattr to check for XGBoost-specific attributes.
- abil.utils.merge_obs_env(obs_path='../data/gridded_abundances.csv', env_path='../data/env_data.nc', env_vars=None, out_path='../data/obs_env.csv')#
Merge observational and environmental datasets based on spatial and temporal indices.
- Parameters:
obs_path (str, default="../data/gridded_abundances.csv") – Path to observational data CSV.
env_path (str, default="../data/env_data.nc") – Path to environmental data NetCDF file.
env_vars (list of str, optional) – List of environmental variables to include in the merge.
out_path (str, default="../data/obs_env.csv") – Path to save the merged dataset.
- Return type:
None
- abil.utils.upsample(d, target, ratio=10)#
Upsample zero and non-zero observations in the dataset to balance classes.
- Parameters:
d (pd.DataFrame) – Input dataframe.
target (str) – Target column for upsampling.
ratio (int, default=10) – Ratio of zeros to non-zero samples after upsampling.
- Returns:
ix – Upsampled dataframe.
- Return type:
pd.DataFrame
- abil.utils.weighted_quantile(x, weights, q=0.5)#
Computes the weighted quantile(s) of a dataset.
Parameters:#
- xarray-like of shape (n_samples,)
The data for which to compute the quantile(s).
- weightsarray-like of shape (n_samples,)
The weights corresponding to each data point in x.
- qfloat or array-like of floats, default=0.5
The quantile(s) to compute. Must be between 0 and 1. If an array is provided, the function will return the weighted quantiles for each value in q.
Returns:#
- resultfloat or list of floats
The weighted quantile(s) corresponding to the input q. If q is a single float, the result is a single value. If q is an array-like, the result is a list of quantiles.”
- abil.utils.xgboost_get_n_estimators(model)#
Recursively extract the n_estimators parameter from an XGBoost model, even if it’s wrapped in a Pipeline or TransformedTargetRegressor.