SL.xgboost_cv: XGBoost SuperLearner wrapper with internal cross-validation for early-stopping

Description

Supports the Extreme Gradient Boosting package for SuperLearnering, which is a variant of gradient boosted machines (GBM). Conducts internal cross-validation and stops when performance plateaus.

Usage

SL.xgboost_cv(
  Y,
  X,
  newX,
  family,
  obsWeights,
  id,
  ntrees = 5000L,
  early_stopping_rounds = 200L,
  nfold = 5L,
  max_depth = 4L,
  shrinkage = 0.1,
  minobspernode = 10L,
  subsample = 0.7,
  colsample_bytree = 0.8,
  gamma = 5,
  stratified = family$family == "binomial",
  eval_metric = ifelse(family$family == "binomial", "auc", "rmse"),
  print_every_n = 400L,
  nthread = getOption("sl.cores", 1L),
  verbose = 0,
  save_period = NULL,
  ...
)

Arguments

Outcome variable

Covariate dataframe

newX

Optional dataframe to predict the outcome

family

"gaussian" for regression, "binomial" for binary classification, "multinomial" for multiple classification (not yet supported).

obsWeights

Optional observation-level weights (supported but not tested)

Optional id to group observations from the same unit (not used currently).

ntrees

How many trees to fit. Low numbers may underfit but high numbers may overfit, depending also on the shrinkage.

early_stopping_rounds

If performance has not improved in this many rounds, stop.

nfold

Number of internal cross-validation folds.

max_depth

How deep each tree can be. 1 means no interactions, aka tree stubs.

shrinkage

How much to shrink the predictions, in order to reduce overfitting.

minobspernode

Minimum observations allowed per tree node, after which no more splitting will occur.

subsample

Observation sub-sampling, to reduce tree correlation.

colsample_bytree

Column sub-sampling, to reduce tree correlation.

gamma

Metric for node-splitting, higher values result in less complex trees.

stratified

If stratified sampling should be used for binary outcomes, defaults to T.

eval_metric

Metric to use for early-stopping, defaults to AUC for classification and RMSE for regression.

print_every_n

Print estimation status every n rounds.

nthread

How many threads (cores) should xgboost use. Generally we want to keep this to 1 so that XGBoost does not compete with SuperLearner parallelization.

verbose

Verbosity of XGB fitting.

save_period

How often (in tree iterations) to save current model to disk during processing. If NULL does not save model, and if 0 saves model at the end.

...

Any remaining arguments (not used).

Details

The performance of XGBoost, like GBM, is sensitive to the configuration settings. Therefore it is best to create multiple configurations using create.SL.xgboost and allow the SuperLearner to choose the best weights based on cross-validated performance.

If you run into errors please first try installing the latest version of XGBoost from CRAN.