Learn R Programming

fifer (version 1.1)

rfThresh: Variable Selection Using Random Forests

Description

Using a set of predictors, this function uses random forests to select the best ones in a stepwise fashion. Both the procedure and the algorithm were borrowed heavily from the VSURF package with some modifications. These modifications allow for unbiased computation of variable importance via the cforest function in the party package.

Usage

rfThresh(formula, data, nruns = 50, silent = FALSE,
  importance = "permutation", nmin = 1, ...)

Arguments

formula
a formula, such as y~x1 + x2, where y is the response variable and anything following ~ are predictors.
data
the dataset containing the predictors and response.
nruns
How many times should random forests be run to compute variable importance? Defaults to 50.
silent
Should the algorithm talk to you?
importance
Either "permutation" or "gini."
nmin
Number of times the "minimum value" is multiplied to set threshold value.
...
other arguments passed to cforest or randomForest

Value

The object returned has the following attributes:
variable.importance
A sorted vector of each variable importance measures.
importance.sd
the standard deviation of variable importance, measured across the nruns iterations.
stepwise.error
The OOB error after each variable is added to the model
response
The response variable that was modeled.
variables
A vector of strings that indicate which variables were included in the initial model.
nruns
How many times the random forest was initially run.
formula
the formula used for the last model.
data
the dataset used to fit the model.
oob
the oob error of the entire model.
time
how long the algorithm ran for
rfmodel
The final model used, a randomForest object.

Details

What follows is the documentation for the original algorithm in VSURF: Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation prupose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction prupose.
  • First step ("thresholding step"): first, nfor.thres random forests are computed using the function randomForest with arguments importance=TRUE. Then variables are sorted according to their mean variable importance (VI), in decreasing order. This order is kept all along the procedure. Next, a threshold is computed: min.thres, the minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI. Finally, the actual "thresholding step" is performed: only variables with a mean VI larger than nmin * min.thres are kept.
  • Second step ("intepretation step"): the variables selected by the first step are considered. nfor.interp embedded random forests models are grown, starting with the random forest build with only the most important variable and ending with all variables selected in the first step. Then, err.min the minimum mean out-of-bag (OOB) error of these models and its associated standard deviation sd.min are computed. Finally, the smallest model (and hence its corresponding variables) having a mean OOB error less than err.min + nsd * sd.min is selected.
  • Third step ("prediction step"): the starting point is the same than in the second step. However, now the variables are added to the model in a stepwise manner. mean.jump, the mean jump value is calculated using variables that have been left out by the second step, and is set as the mean absolute difference between mean OOB errors of one model and its first following model. Hence a variable is included in the model if the mean OOB error decrease is larger than nmj * mean.jump.

References

Genuer, R. and Poggi, J.M. and Tuleau-Malot, C. (2010), Variable selection using random forests, Pattern Recognition Letters 31(14), 2225-2236 Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1): 1-21, 2007. doi: 10.1186/1471-2105-8-25. URL http://dx.doi.org/10.1186/1471-2105-8-25.

See Also

rfInterp, rfPred