qut: Fit a low dimensional GLM or Square-root lasso using the Quantile Universal Threshold

Description

Variable selection with GLM-lasso or Square-root lasso choosing the penalty parameter \(\lambda\) with the Quantile Universal Threshold. The procedure goes towards sparse estimation of the coefficients for good selection of the important predictors.

Usage

qut(y,X,fit,family=gaussian,alpha.level=0.05,M=1000,qut.standardize=TRUE,
intercept=TRUE,offset=NULL,bootstrap=TRUE,sigma=ifelse(n>2*p,'ols','qut'),beta0='iterglm',
estimator='unbiased',type=c('glmnet','lars','flare'),lambda.seq=0,penalty.factor=rep(1,p),
lambda.min.ratio=ifelse(n

Arguments

response variable. Quantitative for family=gaussian, or family=poisson (non-negative counts). For family=binomial should be a factor with two levels.

input matrix, of dimension n x p; each row is an observation vector.

fit

a user supplied glmnet or lars object. Typical usage is to leave it empty so that the program computes the regularization path using the algorithm selected in type. WARNING: use with care, if supplied, object options must match with user supplied options.

family

response type (see above). Default is gaussian.

alpha.level

level, such that quantile \(\tau=(1-\)alpha.level). Default is 0.05.

number of Monte Carlo Simulations to estimate the distribution \(\Lambda\). Default is 1000.

qut.standardize

standardize matrix X with a quantile-based standardization. Default is TRUE. It is not used for sqrt-lasso.

intercept

should intercept(s) be fitted (default=TRUE) or set to zero (FALSE).

offset

a vector of length n that is included in the linear predictor. Useful for the poisson family (e.g. log of exposure time), or for refining a model by starting at a current fit. Default is NULL.

bootstrap

set TRUE if it is desired to bootstrap matrix X when computing the Quantile Universal Threshold (Random scenario). Default is TRUE.

sigma

standard deviation of the Gaussian errors. Used only if family=gaussian. When sigma = 'qut', it is estimated based on the Quantile Universal Threshold (default if n <= 2p); when sigma = 'rcv', it is estimated using Refitted Cross Validation in Fan et al. 2012; and when sigma = 'cv', it is estimated using cross validation as in Reid et al. 2013. If sigma is a positive real number, then that value is used for the standard deviation. If n>p and sigma='ols' it is estimated using the ordinary least squeares estimator (default if n>2p)

beta0

coefficients of the unpenalized covariates for generating the null data for the Quantile Universal Threshold. When is 'iterglm' (Default) or 'iter', it is estimated using one step iteration of the entire procedure with maximum likelihood estimation or the lasso estimation, respectively. If 'noiter' then it is estimated without iterating. If it is desired to set beta0 in advance, then it should be a vector of size the number of unpenalized covariates including the intercept if intercept=TRUE, in the same order. If there are not unpenalized covariates and intercept=TRUE, then it must be a real number.

estimator

type of estimation of sigma when sigma = 'qut'. It can be equal to 'unbiased' (standard unbiased formula), or 'mle' (maximum likelihood formula).

type

algorithm for solving the optimization problem. It can be lars (type='lars') or glmnet (type='glmnet') for GLM-lasso, or flare (type='flare') for Square-root lasso. For GLM-lasso, if family is not gaussian, penalty.factor is different from default, or offset different from NULL, glmnet will be always used. Default is 'glmnet'.

lambda.seq

preset lambda sequence when type = 'glmnet'. If lambda.seq<2 the sequence of lambdas goes decreasing from lambda.max to lambda.qut. If lambda.seq= 0, lambda sequence is equispaced. If lambda.seq= 1, lambda sequence is equispaced in the log scale. Use lambda.seq=2 for glmnet default options. Default is 0.

penalty.factor

separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Note: the penalty factors are internally rescaled to sum to n, and the lambda sequence will reflect this change.

lambda.min.ratio

smallest value for lambda, as a fraction of lambda.max. As in glmnet.

nlambda

the number of lambda. As in glmnet. Default is 100.

lambda

a user supplied lambda sequence. As in glmnet. Not used when type='flare'.

…

glmnet or lars options.

Value

lambda

value of the Quantile Universal Threshold.

fit

object fitted by glmnet or lars.

beta

coefficients obtained with the Quantile Universal Threshold.

betaglm

coefficients obtained fitting GLM with the non zero coefficients in beta.

beta0

estimated value of the intercept when family is not gaussian.

family

response type

sigma

standard deviation estimate of the errors (when family=gaussian)

scale.factor

scale factor used for standardizing \(X\).

References

C. Giacobino, J. Diaz, S. Sardy, N. Hengartner. Quantile universal threshold for model selection. 2016 Jianqing Fan, Shaojun Guo and Ning Hao. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B. 2012 Stephen Reid, Robert Tibshirani, and Jerome Friedman. A Study of Error Variance Estimation in Lasso Regression. 2013

Examples

Run this code

# NOT RUN {
set.seed(1234)
X=matrix(rnorm(50*500),50,500)
beta=c(rep(10,5),rep(0,500-5))
y=X %*% beta+rnorm(50)

outqut=qut(y,X,type='glmnet',family=gaussian,sigma=1) #Fitting with qut
betaqut=outqut$beta[-1]

outcv=cv.glmnet(X,y,family='gaussian') #fitting with Cross-Validation
betacv=coef(outcv$glmnet.fit,s=outcv$lambda.min)[-1]

results=rbind( 	c(sum(betaqut[1:5]!=0),sum(betaqut[-(1:5)]!=0)),
				c(sum( betacv[1:5]!=0), sum(betacv[-(1:5)]!=0)) )
colnames(results)=c('True Positive','False Positive')
rownames(results)=c('qut','cv')

print(results)

# }