xgb.ncv: xgboost repeated cross-validation (Repeated k-fold)

Description

This function allows you to run a repeated cross-validation using xgboost, to get out of fold predictions, and to get predictions from each fold on external data. It currently does not work for non 1-column prediction (only works for binary classification and regression). Verbosity is automatic and cannot be removed. In case you need this function without verbosity, please compile the package after removing verbose messages. In addition, a sink is forced. Make sure to run sink() if you interrupt (or if xgboost interrupts) prematurely the execution of the function. Otherwise, you end up with no more messages printed to your R console.

Usage

xgb.ncv(data, label, extra_data = NA, out_of_fold = TRUE, nfolds = 5,
  ntimes = 3, nthread = 2, seed = 11111, verbose = 1,
  print_every_n = 1, sinkfile = "debug.txt", booster = "gbtree",
  eta = 0.3, max_depth = 6, min_child_weight = 1, gamma = 0,
  subsample = 1, colsample_bytree = 1, num_parallel_tree = 1,
  maximum_rounds = 1e+05, objective = "binary:logistic",
  eval_metric = "logloss", maximize = FALSE, early_stopping_rounds = 50)

Arguments

data

The data as a matrix or sparse matrix.

label

The label associated with the data.

extra_data

The data you want to predict on using the fold models.

out_of_fold

Should we predict out of fold? (this includes both data and extra_data). Defaults to TRUE.

nfolds

How many folds should we use for the validation? The greater the better (increases linearly*ntimes the computation time. Defaults to 5.

ntimes

How many folds should we use? The greater the more stable results (increases linearly*nfolds the computation time.) Defaults to 3.

nthread

How many threads to run for xgboost? Defaults to 2.

seed

Which seed should we use globally for all commands dependent on a random seed? Defaults to 11111.

verbose

Should we print verbose data in xgboost? xgboost messages will be sinked in any case. Defaults to 1.

print_every_n

Every how many iterations should we print verbose data? xgboost messages will be sinked in any case.Defaults to 1.

sinkfile

What file name to give to the sink? This is where printed messages of xgboost will be stored. Defaults to "debug.txt".

booster

What xgboost booster to use? Defaults to "gbtree" and must not be changed (does NOT work otherwise).

eta

The shrinkage in xgboost. The lower the better, but increases exponentially the computation time as it gets lower. Defaults to 0.3.

max_depth

The maximum depth of each tree in xgboost. Defaults to 6.

min_child_weight

The minimum hessian weight needed in a child node. Defaults to 1.

gamma

The minimum loss reduction needed in a child node. Defaults to 0.

subsample

The sampling ratio of observations during each iteration. Use 0.632 to simulate Random Forests. Defaults to 1.

colsample_bytree

The sampling ratio of features during each iteration. Defaults to 1.

num_parallel_tree

How many trees to grow per iteration? A number higher than 1 simulates boosted Random Forests. Defaults to 1.

maximum_rounds

How many rounds until giving up boosting if not stopped early? Defaults to 100000.

objective

The objective function. Defaults to "binary:logistic".

eval_metric

The evaluation metric. Defaults to "logloss".

maximize

Should we maximize the evaluation metric? Defaults to FALSE.

early_stopping_rounds

How many rounds the evaluation metric does not follow the maximization rule to force stopping a boosting iteration of xgboost on a fold? Defaults to 50.

Value

A list with two to four elements: "scores" for the scored folds (data.frame), "folds" for the folds IDs (list), "preds" for out of fold predictions (data.frame), and "extra" for extra data predictions per fold (data.frame).

Examples

Run this code

#Pick your xgb.cv function, replace data by the initial matrix, insert the label,
#check ntimes to the value you want, and change the sinkfile.
#Unlist params if needed, and add the seed as a parameter.

Run the code above in your browser using DataLab