Learn R Programming

rtemis (version 0.79)

s.XGB: XGboost Classification and Regression [C, R]

Description

Tune hyperparameters using grid search and resampling, train a final model, and validate it

Usage

s.XGB(x, y = NULL, x.test = NULL, y.test = NULL, x.name = NULL,
  y.name = NULL, booster = c("gbtree", "gblinear", "dart"),
  silent = 1, missing = NA, nrounds = 500L, force.nrounds = NULL,
  weights = NULL, ipw = TRUE, ipw.type = 2, upsample = FALSE,
  upsample.seed = NULL, obj = NULL, feval = NULL, maximize = NULL,
  xgb.verbose = NULL, print_every_n = 100L,
  early.stopping.rounds = 50L, eta = 0.1, gamma = 0, max.depth = 3,
  min.child.weight = 5, max.delta.step = 0, subsample = 0.75,
  colsample.bytree = NULL, colsample.bylevel = 1, lambda = NULL,
  lambda.bias = 0, alpha = 0, tree.method = "auto",
  sketch.eps = 0.03, num.parallel.tree = 1, base.score = NULL,
  objective = NULL, sample.type = "uniform",
  normalize.type = "forest", rate.drop = 0.1, skip.drop = 0.5,
  resampler = "strat.sub", n.resamples = 10, train.p = 0.75,
  strat.n.bins = 4, stratify.var = NULL, target.length = NULL,
  seed = NULL, error.curve = FALSE, plot.res = TRUE,
  save.res = FALSE, save.res.mod = FALSE, importance = FALSE,
  print.plot = TRUE, plot.fitted = NULL, plot.predicted = NULL,
  plot.theme = getOption("rt.fit.theme", "lightgrid"), question = NULL,
  rtclass = NULL, save.dump = FALSE, verbose = TRUE, n.cores = 1,
  nthread = NULL, parallel.type = c("psock", "fork"), outdir = NULL,
  save.mod = ifelse(!is.null(outdir), TRUE, FALSE))

Arguments

x

Numeric vector or matrix / data frame of features i.e. independent variables

y

Numeric vector of outcome, i.e. dependent variable

x.test

Numeric vector or matrix / data frame of testing set features Columns must correspond to columns in x

y.test

Numeric vector of testing set outcome

x.name

Character: Name for feature set

y.name

Character: Name for outcome

booster

String: Booster to use. Options: "gbtree", "gblinear"

silent

0: print XGBoost messages; 1: print no XGBoost messages

missing

String or Numeric: Which values to consider as missing. Default = NA

nrounds

Integer: Maximum number of rounds to run. Can be set to a high number as early stopping will limit nrounds by monitoring inner CV error

force.nrounds

Integer: Number of rounds to run if not estimating optimal number by CV

weights

Numeric vector: Weights for cases. For classification, weights takes precedence over ipw, therefore set weights = NULL if using ipw. Note: If weight are provided, ipw is not used. Leave NULL if setting ipw = TRUE. Default = NULL

ipw

Logical: If TRUE, apply inverse probability weighting (for Classification only). Note: If weights are provided, ipw is not used. Default = TRUE

ipw.type

Integer 0, 1, 2 1: class.weights as in 0, divided by max(class.weights) 2: class.weights as in 0, divided by min(class.weights) Default = 2

upsample

Logical: If TRUE, upsample cases to balance outcome classes (for Classification only) Caution: upsample will randomly sample with replacement if the length of the majority class is more than double the length of the class you are upsampling, thereby introducing randomness

upsample.seed

Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)

obj

Function: Custom objective function. See ?xgboost::xgboost

feval

Function: Custom evaluation function. See ?xgboost::xgboost

xgb.verbose

Integer: Verbose level for XGB learners used for tuning.

print_every_n

Integer: Print evaluation metrics every this many iterations

early.stopping.rounds

Integer: Training on resamples of x.train (tuning) will stop if performance does not improve for this many rounds

eta

[gS] Float (0, 1): Learning rate. Default = .1

gamma

[gS] Float: Minimum loss reduction required to make further partition

max.depth

[gS] Integer: Maximum tree depth. (Default = 6)

subsample

[gS] Float:

colsample.bytree

[gS]

colsample.bylevel

[gS]

lambda

[gS] L2 regularization on weights

alpha

[gS] L1 regularization on weights

tree.method

[gS] XGBoost tree construction algorithm (Default = "auto")

sketch.eps

[gS] Float (0, 1):

num.parallel.tree

Integer: N of trees to grow in parallel: Results in Random Forest -like algorithm. (Default = 1; i.e. regular boosting)

base.score

Float: The mean outcome response (no need to set)

objective

(Default = NULL)

sample.type

(Default = "uniform")

normalize.type

(Default = "forest")

print.plot

Logical: if TRUE, produce plot using mplot3 Takes precedence over plot.fitted and plot.predicted

plot.fitted

Logical: if TRUE, plot True (y) vs Fitted

plot.predicted

Logical: if TRUE, plot True (y.test) vs Predicted. Requires x.test and y.test

plot.theme

String: "zero", "dark", "box", "darkbox"

question

String: the question you are attempting to answer with this model, in plain language.

rtclass

String: Class type to use. "S3", "S4", "RC", "R6"

verbose

Logical: If TRUE, print summary to screen.

nthread

Integer: Number of threads for xgboost using OpenMP. Only parallelize resamples using n.cores or the xgboost execution using this setting. At the moment of writing, parallelization via this parameter causes a linear booster to fail most of the times. Therefore, default is rtCores for 'gbtree', 1 for 'gblinear'

outdir

Path to output directory. If defined, will save Predicted vs. True plot, if available, as well as full model output, if save.mod is TRUE

save.mod

Logical. If TRUE, save all output as RDS file in outdir save.mod is TRUE by default if an outdir is defined. If set to TRUE, and no outdir is defined, outdir defaults to paste0("./s.", mod.name)

lambda_bias

[gS] for *linear* booster: L2 regularization on bias

Value

rtMod object

Details

[gS]: indicates parameter will be autotuned by grid search if multiple values are passed. (s.XGB does its own grid search, similar to gridSearchLearn, may switch to gridSearchLearn similar to s.GBM) Learn more about XGboost's parameters here: http://xgboost.readthedocs.io/en/latest/parameter.html Case weights and therefore IPW do not seem to work, despite following documentation. See how ipw = T fails and upsample = T works in imbalanced dataset. 11.24.16: Updated to work with latest development version of XGBoost from github, which changed some of xgboost's return values and is therefore not compatible with older versions s.XGBLIN is a wrapper for s.XGB with booster = "gblinear"

See Also

elevate for external cross-validation

Other Supervised Learning: s.ADABOOST, s.ADDTREE, s.BART, s.BAYESGLM, s.BRUTO, s.C50, s.CART, s.CTREE, s.DA, s.ET, s.EVTREE, s.GAM.default, s.GAM.formula, s.GAMSEL, s.GAM, s.GBM3, s.GBM, s.GLMNET, s.GLM, s.GLS, s.H2ODL, s.H2OGBM, s.H2ORF, s.IRF, s.KNN, s.LDA, s.LM, s.MARS, s.MLRF, s.MXN, s.NBAYES, s.NLA, s.NLS, s.NW, s.POLYMARS, s.PPR, s.PPTREE, s.QDA, s.QRNN, s.RANGER, s.RFSRC, s.RF, s.SGD, s.SPLS, s.SVM, s.TFN, s.XGBLIN

Other Tree-based methods: s.ADABOOST, s.ADDTREE, s.BART, s.C50, s.CART, s.CTREE, s.ET, s.EVTREE, s.GBM3, s.GBM, s.H2OGBM, s.H2ORF, s.IRF, s.MLRF, s.PPTREE, s.RANGER, s.RFSRC, s.RF