The cross validation function of xgboost
xgb.cv(
params = list(),
data,
nrounds,
nfold,
label = NULL,
missing = NA,
prediction = FALSE,
showsd = TRUE,
metrics = list(),
obj = NULL,
feval = NULL,
stratified = TRUE,
folds = NULL,
train_folds = NULL,
verbose = TRUE,
print_every_n = 1L,
early_stopping_rounds = NULL,
maximize = NULL,
callbacks = list(),
...
)
the list of parameters. The complete list of parameters is available in the online documentation. Below is a shorter summary:
objective
objective function, common ones are
reg:squarederror
Regression with squared loss.
binary:logistic
logistic regression for classification.
See xgb.train()
for complete list of objectives.
eta
step size of each boosting step
max_depth
maximum depth of the tree
nthread
number of thread used in training, if not set, all threads are used
See xgb.train
for further details.
See also demo/ for walkthrough example in R.
takes an xgb.DMatrix
, matrix
, or dgCMatrix
as the input.
the max number of iterations
the original dataset is randomly partitioned into nfold
equal size subsamples.
vector of response values. Should be provided only when data is an R-matrix.
is only used when input is a dense matrix. By default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values.
A logical value indicating whether to return the test fold predictions
from each CV model. This parameter engages the cb.cv.predict
callback.
boolean
, whether to show standard deviation of cross validation
list of evaluation metrics to be used in cross validation, when it is not specified, the evaluation metric is chosen according to objective function. Possible options are:
error
binary classification error rate
rmse
Rooted mean square error
logloss
negative log-likelihood function
mae
Mean absolute error
mape
Mean absolute percentage error
auc
Area under curve
aucpr
Area under PR curve
merror
Exact matching error, used to evaluate multi-class classification
customized objective function. Returns gradient and second order gradient with given prediction and dtrain.
customized evaluation function. Returns
list(metric='metric-name', value='metric-value')
with given
prediction and dtrain.
a boolean
indicating whether sampling of folds should be stratified
by the values of outcome labels.
list
provides a possibility to use a list of pre-defined CV folds
(each element must be a vector of test fold's indices). When folds are supplied,
the nfold
and stratified
parameters are ignored.
list
list specifying which indicies to use for training. If NULL
(the default) all indices not specified in folds
will be used for training.
boolean
, print the statistics during the process
Print each n-th iteration evaluation messages when verbose>0
.
Default is 1 which means all messages are printed. This parameter is passed to the
cb.print.evaluation
callback.
If NULL
, the early stopping function is not triggered.
If set to an integer k
, training with a validation set will stop if the performance
doesn't improve for k
rounds.
Setting this parameter engages the cb.early.stop
callback.
If feval
and early_stopping_rounds
are set,
then this parameter must be set as well.
When it is TRUE
, it means the larger the evaluation score the better.
This parameter is passed to the cb.early.stop
callback.
a list of callback functions to perform various task during boosting.
See callbacks
. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order
to customize the training process.
other parameters to pass to params
.
An object of class xgb.cv.synchronous
with the following elements:
call
a function call.
params
parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the cb.reset.parameters
callback.
callbacks
callback functions that were either automatically assigned or
explicitly passed.
evaluation_log
evaluation history stored as a data.table
with the
first column corresponding to iteration number and the rest corresponding to the
CV-based evaluation means and standard deviations for the training and test CV-sets.
It is created by the cb.evaluation.log
callback.
niter
number of boosting iterations.
nfeatures
number of features in training data.
folds
the list of CV folds' indices - either those passed through the folds
parameter or randomly generated.
best_iteration
iteration number with the best evaluation metric value
(only available with early stopping).
best_ntreelimit
the ntreelimit
value corresponding to the best iteration,
which could further be used in predict
method
(only available with early stopping).
pred
CV prediction values available when prediction
is set.
It is either vector or matrix (see cb.cv.predict
).
models
a list of the CV folds' models. It is only available with the explicit
setting of the cb.cv.predict(save_models = TRUE)
callback.
The original sample is randomly partitioned into nfold
equal size subsamples.
Of the nfold
subsamples, a single subsample is retained as the validation data for testing the model, and the remaining nfold - 1
subsamples are used as training data.
The cross-validation process is then repeated nrounds
times, with each of the nfold
subsamples used exactly once as the validation data.
All observations are used for both training and validation.
Adapted from https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
# NOT RUN {
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"),
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv)
print(cv, verbose=TRUE)
# }
Run the code above in your browser using DataLab