rfeControl(functions = NULL, rerank = FALSE, method = "boot", saveDetails = FALSE, number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25), repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number), verbose = FALSE, returnResamp = "final", p = .75, index = NULL, indexOut = NULL, timingSamps = 0, seeds = NA, allowParallel = TRUE)
boot
, cv
,
LOOCV
or LGOCV
(for repeated training/test splitsindex
) that dictates which sample are held-out for each resample. If NULL
, then the unique set of samples not contained in index
is used.NA
will stop the seed from being set within the worker processes while a value of NULL
will set the seeds using a random set of integers. Alternatively, a list can be used. The list should have B+1
elements where B
is the number of resamples. The first B
elements of the list should be vectors of integers of length P
where P
is the number of subsets being evaluated (including the full set). The last element of the list only needs to be a single integer (for the final model). See the Examples section below. Backwards selection requires function to be specified for some operations.
The fit
function builds the model based on the current data set. The arguments for the function must be:
x
the current training set of predictor data with
the appropriate subset of variables
y
the current outcome data (either a numeric or
factor vector)
first
a single logical value for whether the
current predictor set has all possible variables
last
similar to first
, but TRUE
when the last model is fit with the final subset size and
predictors.
...
optional arguments to pass to the fit
function in the call to rfe
The function should return a model object that can be used to generate predictions.
The pred
function returns a vector of predictions (numeric or factors) from the current model. The arguments are:
object
the model generated by the fit
function
x
the current set of predictor set for the
held-back samples
The rank
function is used to return the predictors in the order of the most important to the least important. Inputs are:
object
the model generated by the fit
function
x
the current set of predictor set for the
training samples
y
the current training outcomes
The function should return a data frame with a column called var
that has the current variable names. The first row should be the most important predictor etc. Other columns can be included in the output and will be returned in the final rfe
object.
The selectSize
function determines the optimal number of predictors based on the resampling output. Inputs for the function are:
x
a matrix with columns for the performance
metrics and the number of variables, called
"Variables
"
metric
a character string of the performance
measure to optimize (e.g. "RMSE", "Rsquared", "Accuracy"
or "Kappa")
maximize
a single logical for whether the metric
should be maximized
This function should return an integer corresponding to the optimal subset size. caret comes with two examples functions for this purpose: pickSizeBest
and pickSizeTolerance
.
After the optimal subset size is determined, the selectVar
function will be used to calculate the best rankings for each variable across all the resampling iterations. Inputs for the function are:
y
a list of variables importance for each
resampling iteration and each subset size (generated by
the user--defined rank
function). In the example,
each each of the cross--validation groups the output of
the rank
function is saved for each of the
subset sizes (including the original subset). If the
rankings are not recomputed at each iteration, the
values will be the same within each cross-validation
iteration.
size
the integer returned by the
selectSize
function
This function should return a character string of predictor names (of length size
) in the order of most important to least important
Examples of these functions are included in the package: lmFuncs
, rfFuncs
, treebagFuncs
and nbFuncs
.
Model details about these functions, including examples, are at http://topepo.github.io/caret/featureselection.html. .
rfe
, lmFuncs
, rfFuncs
, treebagFuncs
, nbFuncs
, pickSizeBest
, pickSizeTolerance
## Not run:
# subsetSizes <- c(2, 4, 6, 8)
# set.seed(123)
# seeds <- vector(mode = "list", length = 51)
# for(i in 1:50) seeds[[i]] <- sample.int(1000, length(subsetSizes) + 1)
# seeds[[51]] <- sample.int(1000, 1)
#
# set.seed(1)
# rfMod <- rfe(bbbDescr, logBBB,
# sizes = subsetSizes,
# rfeControl = rfeControl(functions = rfFuncs,
# seeds = seeds,
# number = 50))
# ## End(Not run)
Run the code above in your browser using DataLab