2-phase Ensemble#
YAML example#
Before running the model, model specifications need to be defined in a YAML file. For a detailed explanation of each parameter see Model YAML Configuration.
An example of YAML file of a 2-phase model is provided below. Note that compared to a 1-phase regressor model, the hyper-parameters for the classifier also need to be specified.
---
root: ./
run_name: 2-phase #update for specific run name
path_out: tests/ModelOutput/ #root + folder
prediction: docs/examples/data/prediction.csv #root + folder
targets: docs/examples/data/targets.csv #root + folder
training: docs/examples/data/training.csv
predictors: ["feature_1", "feature_2", "feature_3"]
verbose: 1
seed : 1 # random seed
n_threads : 3 # how many cpu threads to use
cv : 3
ensemble_config:
classifier: True
regressor: True
m1: "rf"
m2: "knn"
m3: "xgb"
upsample: False
stratify: True
param_grid:
rf_param_grid:
reg_param_grid:
n_estimators: [100]
max_features: [4]
max_depth: [50]
min_samples_leaf: [0.5]
max_samples: [0.5]
clf_param_grid:
n_estimators: [100]
max_depth: [50]
max_samples: [0.8]
xgb_param_grid:
reg_param_grid:
learning_rate: [0.05]
n_estimators: [100]
max_depth: [7]
subsample: [0.8]
colsample_bytree: [0.5]
gamma: [1]
reg_alpha: [0.1]
clf_param_grid:
learning_rate: [0.01]
n_estimators: [100]
max_depth: [4]
subsample: [0.6]
colsample_bytree: [0.6]
gamma: [1]
reg_alpha: [1]
knn_param_grid:
reg_param_grid:
max_samples: [0.2]
max_features: [0.2]
estimator__leaf_size: [25]
estimator__n_neighbors: [3]
clf_param_grid:
max_samples: [0.2]
max_features: [0.2]
estimator__leaf_size: [25]
estimator__n_neighbors: [3]
knn_bagging_estimators: 3
Running the model#
After specifying the model configuration in the relevant YAML file, we can use the Abil API to 1) tune the model, evaluating the model performance across different hyper-parameter values and then selecting the best configuration 2) predict in-sample and out-of-sample observations based on the optimal hyper-parameter configuration identified in the first step 3) conduct post-processing such as exporting relevant performance metrics, spatially or temporally integrated target estimates, and diversity metrics.
Loading dependencies#
Before running the Python script we need to import all relevant Python packages. For instructions on how to install these packages, see requirements.txt and the Abil Installation.
import numpy as np
from yaml import load
from yaml import CLoader as Loader
from abil.tune import tune
from abil.predict import predict
from abil.post import post
from abil.utils import example_data
Loading the configuration YAML#
After loading the required packages we need to define our file paths. Note that this is operating system specific, as Unix and Mac use ‘/’ while for Windows ‘' is used.
with open('2-phase.yml', 'r') as f:
model_config = load(f, Loader=Loader)
Creating example data#
Next we create some example data. When applying the pipeline to your own data, note that the data needs to be in a Pandas DataFrame format.
target_name = "Emiliania huxleyi"
X_train, X_predict, y = example_data(target_name, n_samples=1000, n_features=3,
noise=0.1, train_to_predict_ratio=0.7,
random_state=59)
Training the model#
Next we train our model. Note that depending on the number of hyper-parameters specified in the YAML file this can be computationally very expensive and it recommended to do this on a HPC system.
m = tune(X_train, y, model_config)
m.train(model="rf")
m.train(model="xgb")
m.train(model="knn")
Making predictions#
After training our model we can make predictions on a new dataset (X_predict):
m = predict(X_train, y, X_predict, model_config)
m.make_prediction()
Post-processing#
Finally, we conduct the post-processing.
targets = np.array([target_name])
def do_post(statistic):
m = post(X_train, y, X_predict, model_config, statistic, datatype="poc")
m.estimate_applicability()
m.estimate_carbon("pg poc")
m.total()
m.merge_env()
m.merge_obs("predictions_obs", targets)
m.export_ds("my_first_2-phase_model")
vol_conversion = 1e3 #L-1 to m-3
integ = m.integration(m, vol_conversion=vol_conversion)
integ.integrated_totals(targets, monthly=True)
integ.integrated_totals(targets)
do_post(statistic="mean")
do_post(statistic="ci95_UL")
do_post(statistic="ci95_LL")