Learn R Programming

Laurae (version 0.0.0.9001)

xgb.ncv: xgboost repeated cross-validation (Repeated k-fold)

Description

This function allows you to run a repeated cross-validation using xgboost, to get out of fold predictions, and to get predictions from each fold on external data. It currently does not work for non 1-column prediction (only works for binary classification and regression). Verbosity is automatic and cannot be removed. In case you need this function without verbosity, please compile the package after removing verbose messages. In addition, a sink is forced. Make sure to run sink() if you interrupt (or if xgboost interrupts) prematurely the execution of the function. Otherwise, you end up with no more messages printed to your R console.

Usage

xgb.ncv(data, label, extra_data = NA, out_of_fold = TRUE, nfolds = 5,
  ntimes = 3, nthread = 2, seed = 11111, verbose = 1,
  print_every_n = 1, sinkfile = "debug.txt", booster = "gbtree",
  eta = 0.3, max_depth = 6, min_child_weight = 1, gamma = 0,
  subsample = 1, colsample_bytree = 1, num_parallel_tree = 1,
  maximum_rounds = 1e+05, objective = "binary:logistic",
  eval_metric = "logloss", maximize = FALSE, early_stopping_rounds = 50)

Arguments

data
The data as a matrix or sparse matrix.
label
The label associated with the data.
extra_data
The data you want to predict on using the fold models.
out_of_fold
Should we predict out of fold? (this includes both data and extra_data). Defaults to TRUE.
nfolds
How many folds should we use for the validation? The greater the better (increases linearly*ntimes the computation time. Defaults to 5.
ntimes
How many folds should we use? The greater the more stable results (increases linearly*nfolds the computation time.) Defaults to 3.
nthread
How many threads to run for xgboost? Defaults to 2.
seed
Which seed should we use globally for all commands dependent on a random seed? Defaults to 11111.
verbose
Should we print verbose data in xgboost? xgboost messages will be sinked in any case. Defaults to 1.
print_every_n
Every how many iterations should we print verbose data? xgboost messages will be sinked in any case.Defaults to 1.
sinkfile
What file name to give to the sink? This is where printed messages of xgboost will be stored. Defaults to "debug.txt".
booster
What xgboost booster to use? Defaults to "gbtree" and must not be changed (does NOT work otherwise).
eta
The shrinkage in xgboost. The lower the better, but increases exponentially the computation time as it gets lower. Defaults to 0.3.
max_depth
The maximum depth of each tree in xgboost. Defaults to 6.
min_child_weight
The minimum hessian weight needed in a child node. Defaults to 1.
gamma
The minimum loss reduction needed in a child node. Defaults to 0.
subsample
The sampling ratio of observations during each iteration. Use 0.632 to simulate Random Forests. Defaults to 1.
colsample_bytree
The sampling ratio of features during each iteration. Defaults to 1.
num_parallel_tree
How many trees to grow per iteration? A number higher than 1 simulates boosted Random Forests. Defaults to 1.
maximum_rounds
How many rounds until giving up boosting if not stopped early? Defaults to 100000.
objective
The objective function. Defaults to "binary:logistic".
eval_metric
The evaluation metric. Defaults to "logloss".
maximize
Should we maximize the evaluation metric? Defaults to FALSE.
early_stopping_rounds
How many rounds the evaluation metric does not follow the maximization rule to force stopping a boosting iteration of xgboost on a fold? Defaults to 50.

Value

A list with two to four elements: "scores" for the scored folds (data.frame), "folds" for the folds IDs (list), "preds" for out of fold predictions (data.frame), and "extra" for extra data predictions per fold (data.frame).

Examples

Run this code
#Pick your xgb.cv function, replace data by the initial matrix, insert the label,
#check ntimes to the value you want, and change the sinkfile.
#Unlist params if needed, and add the seed as a parameter.

Run the code above in your browser using DataLab