bestglm: Best Subset GLM using Information Criterion or Cross-Validation

Description

Best subset selection using 'leaps' algorithm (Furnival and Wilson, 1974) or complete enumeration (Morgan and Tatar, 1972). Complete enumeration is used for the non-Gaussian and for the case where the input matrix contains factor variables with more than 2 levels. The best fit may be found using the information criterion IC: AIC, BIC, EBIC, or BICq. Alternatively, with IC=`CV' various types of cross-validation may be used.

Usage

bestglm(Xy, family = gaussian, IC = "BIC", t = "default", 
 CVArgs = "default", qLevel = 0.99, TopModels = 5, 
 method = "exhaustive", intercept = TRUE, weights = NULL, 
 nvmax = "default", RequireFullEnumerationQ = FALSE, ...)

Arguments

Dataframe containing the design matrix X and the output variable y. All columns must be named.

family

One of the glm distribution functions. The glm function is not used in the Gaussian case. Instead for efficiency either 'leaps' is used or when factor variables are present with more than 2 levels, 'lm' may be used.

Information criteria to use: "AIC", "BIC", "BICg", "BICq", "LOOCV", "CV".

adjustable parameter for BICg, BICq or CV. For BICg, default is g=t=1. For BICq, default is q=t=0.25. For CV, default the delete-d method with d=ceil(n(1-1/(log n - 1))) and REP=t=1000. The default value of the parameter may be changed by changing t.

CVArgs

Used when IC is set to `CV`. The default is use the delete-d algorithm with d=ceil(n(1-1/(log n - 1))) and t=100 repetitions. Note that the number of repetitions can be changed using t. More generally, CVArgs is a list with 3 named components: Method, K, REP, where Method is one of \"HTF\", \"DH\", \"d\" corresponding to using the functions CVHTM (Hastie et al., 2009, K-fold CV), CVDH (adjusted K-fold CV, Davison and Hartigan, 1997) and CVd (delete-d CV with random subsamples, Shao, 1997).

qLevel

the alpha level for determining interval for best q. Larger alpha's result in larger intervals.

TopModels

Finds the best TopModels models.

method

Method used in leaps algorithm for searching for the best subset.

intercept

Default TRUE means the intercept term is always included. If set to FALSE, no intercept term is included. If you want only include the intercept term when it is signficant then set IncludeInterceptQ=FALSE and include a column of 1's in the design matrix.

weights

nvmax

maximum number of independent variables allowed. By default, all variables

RequireFullEnumerationQ

Use exhaustive search algorithm instead of 'leaps'

…

Optional arguments which are passed to lm or glm

Value

A list with class attribute 'bestglm' and named components:

BestModel

An lm-object representing the best fitted regression.

Title

A brief title describing the algorithm used: CV(K=K), CVadj(K=K), CVd(d=K). The range of q for an equivalent BICq model is given.

Subsets

The best subsets of size, k=0,1,...,p are indicated as well the value of the log-likelihood and information criterion for each best subset. In the case of categorical variables with more than 2 levels, the degrees of freedom are also shown.

qTable

Table showing range of q for choosing each possible subset size. Assuming intercept=TRUE, k=1 corresponds to model with only an intercept term and k=p+1, where p is the number of input variables, corresponds to including all variables.

Bestq

Optimal q

ModelReport

A list with components: NullModel, LEAPSQ, glmQ, gaussianQ, NumDF, CategoricalQ, Bestk.

BestModels

Variables in the TopModels best list

Methods function 'print.bestglm' and 'summary.bestglm' are provided.

Details

In the Gaussian case, the loglikelihood may be written \(logL = -(n/2) log (RSS/n)\), where RSS is the residual sum-of-squares and n is the number of observations. When the function 'glm' is used, the log-likelihood, logL, is obtained using 'logLik'. The penalty for EBIC and BICq depends on the tuning parameter argument, t. The argument t also controls the number of replications used when the delete-d CV is used as default. In this case, the parameter d is chosen using the formula recommended by Shao (1997). See CVd for more details.

In the binomial GLM, nonlogistic, case the last two columns of Xy are the counts of 'success' and 'failures'.

Cross-validation may also be used to select the best subset. When cross-validation is used, the best models of size k according to the log-likelihood are compared for k=0,1,...,p, where p is the number of inputs. Cross-validation is not available when there are categorical variables since in this case it is likely that the training sample may not contain all levels and in this case we can't predict the response in the validation sample. In the case of GLM, the \"DH\" method for CV is not available.

Usually it is a good idea to keep the intercept term even if it is not significant. See discussion in vignette.

Cross-validation is not available for models with no intercept term or when force.in is non-null or when nvmax is set to less than the full number of independent variables.

Please see the package vignette for more details and examples.

References

Xu, C. and McLeod, A.I. (2009). Bayesian Information Criterion with Bernouilli Prior.

Chen, J. and Chen, Z. (2008). Extended Bayesian Information Criteria for Model Selection with Large Model Space. Biometrika 2008 95: 759-771.

Furnival, G.M. and Wilson, R. W. (1974). Regressions by Leaps and Bounds. Technometrics, 16, 499--511.

Morgan, J. A. and Tatar, J. F. (1972). Calculation of the Residual Sum of Squares for All Possible Regressions. Technometrics 14, 317-325.

Miller, A. J. (2002), Subset Selection in Regression, 2nd Ed. London, Chapman and Hall.

Shao, Jun (1997). An Asymptotic Theory for Linear Model Selection. Statistica Sinica 7, 221-264.

Examples

Run this code

# NOT RUN {
#Example 1. 
#White noise test.
set.seed(123321123)
p<-25   #number of inputs
n<-100  #number of observations
X<-matrix(rnorm(n*p), ncol=p)
y<-rnorm(n)
Xy<-as.data.frame(cbind(X,y))
names(Xy)<-c(paste("X",1:p,sep=""),"y")
bestAIC <- bestglm(Xy, IC="AIC")
bestBIC <- bestglm(Xy, IC="BIC")
bestEBIC <- bestglm(Xy, IC="BICg")
bestBICq <- bestglm(Xy, IC="BICq")
NAIC <- length(coef(bestAIC$BestModel))-1
NBIC <- length(coef(bestBIC$BestModel))-1
NEBIC <- length(coef(bestEBIC$BestModel))-1
NBICq <- length(coef(bestBICq$BestModel))-1
ans<-c(NAIC, NBIC, NEBIC, NBICq)
names(ans)<-c("AIC", "BIC", "BICg", "BICq")
ans
# AIC  BIC EBIC BICq 
#   3    1    0    0 

#Example 2. bestglm with BICq
#Find best model. Default is BICq with q=0.25
data(znuclear) #standardized data. 
#Rest of examples assume this dataset is loaded.
out<-bestglm(znuclear, IC="BICq")
out
#The optimal range for q
out$Bestq
#The possible models that can be chosen
out$qTable
#The best models for each subset size
out$Subsets
#The overall best models
out$BestModels
#
#Example 3. Normal probability plot, residuals, best model
ans<-bestglm(znuclear, IC="BICq")
e<-resid(ans$BestModel)
qqnorm(e, ylab="residuals, best model")
#
#To save time, none of the remaining examples are run
# }
# NOT RUN {
#Example 4. bestglm, using EBIC, g=1
bestglm(znuclear, IC="BICg")
#EBIC with g=0.5
bestglm(znuclear, IC="BICg", t=0.5)
#
#Example 5. bestglm, CV
data(zprostate)
train<-(zprostate[zprostate[,10],])[,-10]
#the default CV method takes too long, set t=10 to do only
# 10 replications instead of the recommended 1000
bestglm(train, IC="CV", t=10)
bestglm(train, IC="CV", CVArgs=list(Method="HTF", K=10, REP=1))
#Compare with DH Algorithm. Normally set REP=100 is recommended.
bestglm(train, IC="CV", CVArgs=list(Method="DH", K=10, REP=1))
#Compare LOOCV
bestglm(train, IC="LOOCV")
#
#Example 6. Optimal q for manpower dataset
data(manpower)
out<-bestglm(manpower)
out$Bestq
#
#Example 7. Factors with more than 2 levels
data(AirQuality)
bestglm(AirQuality)
#
#Example 8. Logistic regression
data(SAheart)
bestglm(SAheart, IC="BIC", family=binomial)
#BIC agrees with backward stepwise approach
out<-glm(chd~., data=SAheart, family=binomial)
step(out, k=log(nrow(SAheart)))
#but BICq with q=0.25
bestglm(SAheart, IC="BICq", t=0.25, family=binomial)
#
#Cross-validation with glm
#make reproducible results
set.seed(33997711)
#takes about 15 seconds and selects 5 variables
bestglm(SAheart, IC="CV", family=binomial)
#about 6 seconds and selects 2 variables
bestglm(SAheart, IC="CV", CVArgs=list(Method="HTF", K=10, REP=1), family=binomial)
#Will produce an error -- NA
\dontrun{bestglm(SAheart, IC="CV", CVArgs=list(Method="DH", K=10, REP=1), family=binomial)}
\dontrun{bestglm(SAheart, IC="LOOCV", family=binomial)}
#
#Example 9. Model with no intercept term
X<-matrix(rnorm(200*3), ncol=3)
b<-c(0, 1.5, 0)
y<-X%*%b + rnorm(40)
Xy<-data.frame(as.matrix.data.frame(X), y=y)
bestglm(Xy, intercept=FALSE) 
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab