varImp: Calculation of variable importance for regression and classification models

Description

A generic method for calculating variable importance for objects produced by train and method specific methods

Usage

## S3 method for class 'train':
varImp(object, useModel = TRUE, nonpara = TRUE, scale = TRUE, ...)
## S3 method for class 'earth':
varImp(object, value = "grsq", ...)
## S3 method for class 'rpart':
varImp(object, ...)
## S3 method for class 'randomForest':
varImp(object, ...)
## S3 method for class 'gbm':
varImp(object, numTrees, ...)
## S3 method for class 'classbagg':
varImp(object, ...)
## S3 method for class 'regbagg':
varImp(object, ...)
## S3 method for class 'pamrtrained':
varImp(object, threshold, data, ...)
## S3 method for class 'lm':
varImp(object, ...)
## S3 method for class 'mvr':
varImp(object, ...)
## S3 method for class 'bagEarth':
varImp(object, ...)
## S3 method for class 'RandomForest':
varImp(object, normalize = TRUE, ...)

Arguments

object

an object corresponding to a fitted model

useModel

use a model based technique for measuring variable importance? This is only used for some models (lm, pls, rf, rpart, gbm, pam and mars)

nonpara

should nonparametric methods be used to assess the relationship between the features and response (only used with useModel = FALSE and only passed to filterVarImp).

scale

should the importances be scaled to 0 and 100?

...

parameters to pass to the specific varImp methods

numTrees

the number of iterations (trees) to use in a boosted tree model

threshold

the shrinkage threshold (pamr models only)

data

the training set predictors (pamr models only)

value

the statistic that will be used to calculate importance: either grsq, rsq, rss or gcv

normalize

a logical; should the OOB mean importance values be divided by their standard deviations?

Value

A data frame with class c("varImp.train", "data.frame") for varImp.train or a matrix for other models.

Details

For models that do not have corresponding varImp methods, see filerVarImp.

Otherwise:

Linear Models: the absolute value of the t--statistic for each model parameter is used. Random Forest: varImp.randomForest and varImp.RandomForest are wrappers around the importance functions from the randomForest and party packages, respectively. Partial Least Squares: the variable importance measure here is based on weighted sums of the absolute regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components and are computed separately for each outcome. Therefore, the contribution of the coefficients are weighted proportionally to the reduction in the sums of squares. Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control. This method does not currently provide class--specific measures of importance when the response is a factor. Bagged Trees: The same methodology as a single tree is applied to all bootstrapped trees and the total importance is returned

Boosted Trees: varImp.gbm is a wrapper around the function from that package (see the gbm package vignette) Multivariate Adaptive Regression Splines: MARS models already include a backwards elimination feature selection routine that looks at reductions in the generalized cross--validation (GCV) estimate of error. The varImp function tracks the changes in model statistics, such as the GCV, for each predictor and accumulates the reduction in the statistic when each predictor's feature is added to the model. This total reduction is used as the variable importance measure. If a predictor was never used in any MARS basis function, it has an importance value of zero. There are four statistics that can be used to estimate variable importance in MARS models. Using varImp(object, value = "gcv") tracks the reduction in the generalized cross--validation statistic as terms are added. Also, the option varImp(object, value = "grsq") compares the GCV statistic for each model to the intercept only model. However, there are some cases when terms are retained in the model that result in an increase in GCV. Negative variable importance values for MARS are set to a small, non-zero number. Alternatively, using varImp(object, value = "rss") monitors the change in the residual sums of squares (RSS) as terms are added, which will never be negative. Also, the mars function stops iterating the forward selection routine when the ratio of the current RSS over the RSS from the intercept only model. Nearest shrunken centroids: The difference between the class centroids and the overall centroid is used to measure the variable influence (see pamr.predict). The larger the difference between the class centroid and the overall center of the data, the larger the separation between the classes. The training set predictions must be supplied when an object of class pamrtrained is given to varImp.

[object Object]

models