1-phase Ensemble#

YAML example#

Before running the model, model specifications need to be defined in a YAML file. For a detailed explanation of each parameter see Model YAML Configuration.

An example of YAML file of a 1-phase model is provided below.

---
  root: ./

  run_name: regressor #update for specific run name
  path_out:  tests/ModelOutput/ #root + folder

  prediction:  docs/examples/data/prediction.csv #root + folder
  targets:  docs/examples/data/targets.csv #root + folder
  training:  docs/examples/data/training.csv

  predictors: ["feature_1", "feature_2", "feature_3"]

  verbose: 1
  seed : 1 # random seed
  n_threads : 3 # how many cpu threads to use
  cv : 10

  ensemble_config: 
    classifier: False
    regressor: True
    m1: "rf"
    m2: "xgb"
    m3: "knn"

  upsample: False
  stratify: False

  param_grid:
    rf_param_grid:
      reg_param_grid:
        n_estimators: [100]
        max_features: [0.2, 0.4, 0.8]
        max_depth: [50]
        min_samples_leaf: [0.5]
        max_samples: [0.5]     

    xgb_param_grid:
      reg_param_grid:  
        learning_rate: [0.05]
        n_estimators: [100]
        max_depth: [7]
        subsample: [0.8]  
        colsample_bytree: [0.5]
        gamma: [1] 
        reg_alpha: [0.1]   

    knn_param_grid:
      reg_param_grid:  
        max_samples: [0.85]
        max_features: [0.85]
        estimator__leaf_size: [5]
        estimator__n_neighbors: [5]
        estimator__p:  [1]        
        estimator__weights: ["uniform"]

  knn_bagging_estimators: 30

Running the model#

After specifying the model configuration in the relevant YAML file, we can use the Abil API to 1) tune the model, evaluating the model performance across different hyper-parameter values and then selecting the best configuration 2) predict in-sample and out-of-sample observations based on the optimal hyper-parameter configuration identified in the first step 3) conduct post-processing such as exporting relevant performance metrics, spatially or temporally integrated target estimates, and diversity metrics.

Loading dependencies#

Before running the Python script we need to import all relevant Python packages. For instructions on how to install these packages, see requirements.txt and the Abil Installation.

import numpy as np
from yaml import load
from yaml import CLoader as Loader
from abil.tune import tune
from abil.predict import predict
from abil.post import post
from abil.utils import example_data 

Loading the configuration YAML#

After loading the required packages we need to define our file paths. Note that this is operating system specific, as Unix and Mac use ‘/’ while for Windows ‘' is used.

with open('regressor.yml', 'r') as f:
    model_config = load(f, Loader=Loader)

Creating example data#

Next we create some example data. When applying the pipeline to your own data, note that the data needs to be in a Pandas DataFrame format.

target_name =  "Emiliania huxleyi"
X_train, X_predict, y = example_data(target_name, n_samples=1000, n_features=3, 
                                    noise=0.1, train_to_predict_ratio=0.7, 
                                    random_state=59)

Training the model#

Next we train our model. Note that depending on the number of hyper-parameters specified in the YAML file this can be computationally very expensive and it recommended to do this on a HPC system.

m = tune(X_train, y, model_config)
m.train(model="rf")
m.train(model="xgb")
m.train(model="knn")

Making predictions#

After training our model we can make predictions on a new dataset (X_predict):

m = predict(X_train, y, X_predict, model_config)
m.make_prediction()

Post-processing#

Finally, we conduct the post-processing.

targets = np.array([target_name])
def do_post(statistic):
    m = post(X_train, y, X_predict, model_config, statistic, datatype="poc")
    
    m.estimate_applicability()
    m.estimate_carbon("pg poc")
    m.total()

    m.merge_env()
    m.merge_obs("predictions_obs", targets)

    m.export_ds("my_first_regressor_model")

    vol_conversion = 1e3 #L-1 to m-3
    integ = m.integration(m, vol_conversion=vol_conversion)
    integ.integrated_totals(targets, monthly=True)
    integ.integrated_totals(targets)

do_post(statistic="mean")
do_post(statistic="ci95_UL")
do_post(statistic="ci95_LL")