This function takes a linear model and selects the subset of predictor variables that meet a user-specific collinearity threshold measured by the (generalized) variance-inflation factor (VIF).
stepVIF(model, threshold = 10, verbose = FALSE)
Linear model (object of class 'lm') containing collinear predictor variables.
Positive number defining the maximum allowed VIF. Defaults to threshold = 10
.
Logical indicating if iteration results should be printed. Defaults to
verbose = FALSE
.
A linear model (object of class 'lm') with low collinearity.
The car package, provider of functions to accompany Fox and Weisberg's An R Companion to
Applied Regression, is required for plotHist()
to work. The development version of
the car package is available on https://r-forge.r-project.org/projects/car/ while its old
versions are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/car/.
stepVIF
starts computing the VIF of all predictor variables in the linear model. If the linear
model contains categorical predictor variables, generalized variance-inflation factors (GVIF)
(Fox and Monette, 1992), are calculated instead using car::vif()
. GVIF is interpretable as the
inflation in size of the confidence ellipse or ellipsoid for the coefficients of the predictor
variable in comparison with what would be obtained for orthogonal, uncorrelated data. Since
categorical predictors have more than one degree of freedom, df, the confidence ellipsoid will
have df dimensions, and GVIF will need to be adjusted so that it can be comparable across
predictor variables. The adjustment is made using the following equation:
\(GVIF^{1/(2\times df)}\)
The next step consists of evaluating if any of the predictor variables has a (G)VIF larger than
the specified threshold, the function default being threshold = 10
. For, GVIF^(1/(2*df)), the
threshold will be sqrt(threshold)
.
If there is only one predictor variable that does not meet the VIF threshold, it is automatically
removed from the model and no further processing occurs. When there are two or more predictor
variables that do not meet the (G)VIF threshold, stepVIF()
fits a linear model
between each of them and the dependent variable. The predictor variable with the lowest adjusted
coefficient of determination is dropped from the model and new coefficients are calculated,
resulting in a new linear model.
This process lasts until all predictor variables included in the new model meet the (G)VIF threshold.
Nothing is done if all predictor variables have a (G)VIF value lower that the threshold, and
stepVIF()
returns the original linear model.
Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178--183.
Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage.
Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition. Thousand Oaks: Sage.
Hair, J. F., Black, B., Babin, B. and Anderson, R. E. (2010) Multivariate data analysis. New Jersey: Pearson Prentice Hall.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
A. Samuel-Rosa, G. B. M. Heuvelink, G. de Mattos Vasques, and L. H. C. dos Anjos, Do more detailed environmental covariates deliver more accurate soil maps?, Geoderma, vol. 243<U+2013>244, pp. 214<U+2013>227, May 2015, doi: 10.1016/j.geoderma.2014.12.017.
# NOT RUN {
if (require(car)) {
fit <- lm(prestige ~ income + education + type, data = Duncan)
fit <- stepVIF(fit, threshold = 10, verbose = TRUE)
}
# }
Run the code above in your browser using DataLab