Area of Applicability#

abil.analyze.area_of_applicability(X_test, X_train, y_train=None, model=None, cv=None, metric='euclidean', feature_weights='permutation', feature_weight_kwargs=None, threshold='tukey', return_all=False)#

Estimate the area of applicability for the data using a strategy similar to Meyer & Pebesma 2022).

This calculates the importance-weighted feature distances from test to train points, and then defines the “applicable” test sites as those closer than some threshold distance.

Parameters:
  • X_test (numpy.ndarray) – array of features to be used in the estimation of the area of applicability

  • X_train (numpy.ndarray) – the training features used to calibrate cutoffs for the area of applicability

  • y_train (numpy.ndarray) – the outcome values to estimate feature importance weights. Must be provided if the permutation feature importance is calculated.

  • model (sklearn.BaseEstimator) – the model for which the feature importance will be calculated. Must be provided if the permutation feature importance is calculated.

  • cv (sklearn.BaseCrossValidator) – the crossvalidator to use on the input training data, in order to calculate the within-sample extrapolation threshold

  • metric (str (Default: 'euclidean')) – the name of the metric used to calculate feature-based distances.

  • feature_weights (str or numpy.ndarray (Default: 'permutation')) – the name of the feature importance weighting strategy to be used. By default, scikit-learn’s permutation feature importance is used. Pre-calculated feature importance scores can also be used. To ignore feature importance, set feature_weights=False.

  • feature_weight_kwargs (dict()) – options to pass to the feature weight estimation function. By default, these are passed directly to sklearn.inspection.permutation_importance()

  • threshold (str or float (Default: 'tukey')) – how to calculate the cutoff value to determine whether a model is applicable for a given test point. This cutoff is calculated within the training data, and applied to the test data. - ‘tukey’: use the tukey rule, setting the cutoff at 1.5 times the inter-quartile range (IQR) above the upper hinge (75th percentile) for the train data dissimilarity index. - ‘mad’: use a median absolute deviation rule, setting the cutoff at three times the median absolute deviation above the median train data dissimilarity index - float: if a value between zero and one is provided, then the cutoff is set at the percentile provided for the train data dissimilarity index.

  • return_all (bool (Default: False)) – whether to return the dissimilarity index and density of train points near the test point. Specifically, the dissimilarity index is the distance from test to train points in feature space, divided by the average distance between training points. The local density is the count of training datapoints whose feature distance is closer than the threshold value.

Returns:

  • If return_local_density=False, the output is a numpy.ndarray of shape (n_training_samples, ) describing where a model

  • might be considered “applicable” among the test samples.

  • If return_local_density=True, then the output is a tuple of numpy arrays.

  • The first element is the applicability mentioned above, the second is the

  • dissimilarity index for the test points, and the thord

  • is the local density of training points near each test point.

  • A value of 0 indicates the point is within the Area of Applicability,

  • while a value of 1 indicates the point is outside the Area of Applicability.