rfThresh: Variable Selection Using Random Forests

Description

Using a set of predictors, this function uses random forests to select the best ones in a stepwise fashion. Both the procedure and the algorithm were borrowed heavily from the VSURF package with some modifications. These modifications allow for unbiased computation of variable importance via the cforest function in the party package.

Usage

rfThresh(formula, data, nruns = 50, silent = FALSE,
  importance = "permutation", nmin = 1, ...)

Arguments

formula

a formula, such as y~x1 + x2, where y is the response variable and anything following ~ are predictors.

data

the dataset containing the predictors and response.

nruns

How many times should random forests be run to compute variable importance? Defaults to 50.

silent

Should the algorithm talk to you?

importance

Either "permutation" or "gini."

nmin

Number of times the "minimum value" is multiplied to set threshold value.

...

other arguments passed to cforest or randomForest

Value

The object returned has the following attributes:

variable.importance

A sorted vector of each variable importance measures.

importance.sd

the standard deviation of variable importance, measured across the nruns iterations.

stepwise.error

The OOB error after each variable is added to the model

response

The response variable that was modeled.

variables

A vector of strings that indicate which variables were included in the initial model.

nruns

How many times the random forest was initially run.

formula

the formula used for the last model.

data

the dataset used to fit the model.

oob

the oob error of the entire model.

time

how long the algorithm ran for

rfmodel

The final model used, a randomForest object.

Details

What follows is the documentation for the original algorithm in VSURF: Three steps variable selection procedure based on random forests for supervised classification and regression problems. First step ("thresholding step") is dedicated to eliminate irrelevant variables from the dataset. Second step ("interpretation step") aims to select all variables related to the response for interpretation prupose. Third step ("prediction step") refines the selection by eliminating redundancy in the set of variables selected by the second step, for prediction prupose.

First step ("thresholding step"): first, nfor.thres random forests are computed using the function randomForest with arguments importance=TRUE. Then variables are sorted according to their mean variable importance (VI), in decreasing order. This order is kept all along the procedure. Next, a threshold is computed: min.thres, the minimum predicted value of a pruned CART tree fitted to the curve of the standard deviations of VI. Finally, the actual "thresholding step" is performed: only variables with a mean VI larger than nmin * min.thres are kept.
Second step ("intepretation step"): the variables selected by the first step are considered. nfor.interp embedded random forests models are grown, starting with the random forest build with only the most important variable and ending with all variables selected in the first step. Then, err.min the minimum mean out-of-bag (OOB) error of these models and its associated standard deviation sd.min are computed. Finally, the smallest model (and hence its corresponding variables) having a mean OOB error less than err.min + nsd * sd.min is selected.
Third step ("prediction step"): the starting point is the same than in the second step. However, now the variables are added to the model in a stepwise manner. mean.jump, the mean jump value is calculated using variables that have been left out by the second step, and is set as the mean absolute difference between mean OOB errors of one model and its first following model. Hence a variable is included in the model if the mean OOB error decrease is larger than nmj * mean.jump.

References