stepVIF: Variable selection using the (generalized) variance-inflation factor (VIF)

Description

This function takes a linear model and selects the subset of predictor variables that meet a user-specific collinearity threshold measured by the (generalized) variance-inflation factor (VIF).

Usage

stepVIF(model, threshold = 10, verbose = FALSE)

Arguments

model

Linear model (object of class 'lm') containing collinear predictor variables.

threshold

Positive number defining the maximum allowed VIF. Defaults to threshold = 10.

verbose

Logical indicating if iteration results should be printed. Defaults to verbose = FALSE.

Value

A linear model (object of class 'lm') with low collinearity.

Dependencies

The car package, provider of functions to accompany Fox and Weisberg's An R Companion to Applied Regression, is required for plotHist() to work. The development version of the car package is available on https://r-forge.r-project.org/projects/car/ while its old versions are available on the CRAN archive at https://cran.r-project.org/src/contrib/Archive/car/.

Details

stepVIF starts computing the VIF of all predictor variables in the linear model. If the linear model contains categorical predictor variables, generalized variance-inflation factors (GVIF) (Fox and Monette, 1992), are calculated instead using car::vif(). GVIF is interpretable as the inflation in size of the confidence ellipse or ellipsoid for the coefficients of the predictor variable in comparison with what would be obtained for orthogonal, uncorrelated data. Since categorical predictors have more than one degree of freedom, df, the confidence ellipsoid will have df dimensions, and GVIF will need to be adjusted so that it can be comparable across predictor variables. The adjustment is made using the following equation:

\(GVIF^{1/(2\times df)}\)

The next step consists of evaluating if any of the predictor variables has a (G)VIF larger than the specified threshold, the function default being threshold = 10. For, GVIF^(1/(2*df)), the threshold will be sqrt(threshold).

If there is only one predictor variable that does not meet the VIF threshold, it is automatically removed from the model and no further processing occurs. When there are two or more predictor variables that do not meet the (G)VIF threshold, stepVIF() fits a linear model between each of them and the dependent variable. The predictor variable with the lowest adjusted coefficient of determination is dropped from the model and new coefficients are calculated, resulting in a new linear model.

This process lasts until all predictor variables included in the new model meet the (G)VIF threshold.

Nothing is done if all predictor variables have a (G)VIF value lower that the threshold, and stepVIF() returns the original linear model.

References

Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178--183.

Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage.

Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition. Thousand Oaks: Sage.

Hair, J. F., Black, B., Babin, B. and Anderson, R. E. (2010) Multivariate data analysis. New Jersey: Pearson Prentice Hall.

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.

A. Samuel-Rosa, G. B. M. Heuvelink, G. de Mattos Vasques, and L. H. C. dos Anjos, Do more detailed environmental covariates deliver more accurate soil maps?, Geoderma, vol. 243<U+2013>244, pp. 214<U+2013>227, May 2015, doi: 10.1016/j.geoderma.2014.12.017.

Examples

Run this code

# NOT RUN {
if (require(car)) {
  fit <- lm(prestige ~ income + education + type, data = Duncan)
  fit <- stepVIF(fit, threshold = 10, verbose = TRUE)
}
# }

Run the code above in your browser using DataLab