train
function
trainControl(method = "boot", number = ifelse(grepl("cv", method), 10, 25), repeats = ifelse(grepl("cv", method), 1, number), p = 0.75, search = "grid", initialWindow = NULL, horizon = 1, fixedWindow = TRUE, skip = 0, verboseIter = FALSE, returnData = TRUE, returnResamp = "final", savePredictions = FALSE, classProbs = FALSE, summaryFunction = defaultSummary, selectionFunction = "best", preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9), sampling = NULL, index = NULL, indexOut = NULL, indexFinal = NULL, timingSamps = 0, predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5, alpha = 0.05, method = "gls", complete = TRUE), trim = FALSE, allowParallel = TRUE)
"boot"
, "boot632"
,
"cv"
, "repeatedcv"
, "LOOCV"
, "LGOCV"
(for
repeated training/test splits), "none"
(only fits one model to the
entire training set), "oob"
(only for random forest, bagged trees,
bagged earth, bagged flexible discriminant analysis, or conditional tree
forest models), "adaptive_cv"
, "adaptive_boot"
or
"adaptive_LGOCV"
"grid"
or "random"
, describing how the
tuning parameter grid is determined. See details below.createTimeSlices
"final"
, "all"
or "none"
"all"
,
"final"
, or "none"
. A logical value can also be used that
convert to "all"
(for true) or "none"
(for false).
"final"
saves the predictions for the optimal tuning parameters.defaultSummary
.best
for details and other options.preProcess
.
The type of pre-processing (e.g. center, scaling etc) is passed in via the
preProc
option in train
."none"
, "down"
, "up"
,
"smote"
, or "rose"
. The latter two values require the
DMwR and ROSE packages, respectively. This argument can also be
a list to facilitate custom sampling and these details can be found on the
caret package website for sampling (link below).index
) that dictates which
data are held-out for each resample (as integers). If NULL
, then the
unique set of samples not contained in index
is used.NULL
, then
entire data set is used.c(TRUE, FALSE)
would only constrain the lower end of predictions. If numeric, specific
bounds can be used. For example, if c(10, NA)
, values below 10 would
be predicted as 10 (with no constraint in the upper side).NA
will stop the seed from being set within the
worker processes while a value of NULL
will set the seeds using a
random set of integers. Alternatively, a list can be used. The list should
have B+1
elements where B
is the number of resamples, unless
method
is "boot632"
in which case B
is the number of
resamples plus 1. The first B
elements of the list should be vectors
of integers of length M
where M
is the number of models being
evaluated. The last element of the list only needs to be a single integer
(for the final model). See the Examples section below and the Details
section.method
is "adaptive_cv"
,
"adaptive_boot"
or "adaptive_LGOCV"
. See Details below.TRUE
the final model in
object\$finalModel
may have some components of the object removed so
reduce the size of the saved object. The predict
method will still
work, but some other features of the model may not work. trim
ing will
occur only for models where this feature has been implemented.train
does some optimizations
for certain models. For example, when tuning over PLS model, the only model
that is fit is the one with the largest number of components. So if the
model is being tuned over comp in 1:10
, the only model fit is
ncomp = 10
. However, if the vector of integers used in the
seeds
arguments is longer than actually needed, no error is thrown.Using method = "none"
and specifying more than one model in
train
's tuneGrid
or tuneLength
arguments will
result in an error.
Using adaptive resampling when method
is either "adaptive_cv"
,
"adaptive_boot"
or "adaptive_LGOCV"
, the full set of resamples
is not run for each model. As resampling continues, a futility analysis is
conducted and models with a low probability of being optimal are removed.
These features are experimental. See Kuhn (2014) for more details. The
options for this procedure are:
min
: the minimum number of resamples used before
models are removed alpha
: the confidence level of the one-sided
intervals used to measure futility method
: either generalized
least squares (method = "gls"
) or a Bradley-Terry model (method
= "BT"
) complete
: if a single parameter value is found before
the end of resampling, should the full set of resamples be computed for that
parameter. )
The option search = "grid"
uses the default grid search routine. When
search = "random"
, a random search procedure is used (Bergstra and
Bengio, 2012). See http://topepo.github.io/caret/random.html for
details and an example.
The "boot632"
method uses the 0.632 estimator presented in Efron
(1983), not to be confused with the 0.632+ estimator proposed later by the
same author.
Bergstra and Bengio (2012), ``Random Search for Hyper-Parameter Optimization'', Journal of Machine Learning Research, 13(Feb):281-305
Kuhn (2014), ``Futility Analysis in the Cross-Validation of Machine Learning Models'' http://arxiv.org/abs/1405.6974,
Package website for subsampling: http://topepo.github.io/caret/sampling.html
## Not run:
#
# ## Do 5 repeats of 10-Fold CV for the iris data. We will fit
# ## a KNN model that evaluates 12 values of k and set the seed
# ## at each iteration.
#
# set.seed(123)
# seeds <- vector(mode = "list", length = 51)
# for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)
#
# ## For the last model:
# seeds[[51]] <- sample.int(1000, 1)
#
# ctrl <- trainControl(method = "repeatedcv",
# repeats = 5,
# seeds = seeds)
#
# set.seed(1)
# mod <- train(Species ~ ., data = iris,
# method = "knn",
# tuneLength = 12,
# trControl = ctrl)
#
#
# ctrl2 <- trainControl(method = "adaptive_cv",
# repeats = 5,
# verboseIter = TRUE,
# seeds = seeds)
#
# set.seed(1)
# mod2 <- train(Species ~ ., data = iris,
# method = "knn",
# tuneLength = 12,
# trControl = ctrl2)
#
# ## End(Not run)
Run the code above in your browser using DataLab