Variable selection with GLM-lasso or Square-root lasso choosing the penalty parameter \(\lambda\) with the Quantile Universal Threshold. The procedure goes towards sparse estimation of the coefficients for good selection of the important predictors.
qut(y,X,fit,family=gaussian,alpha.level=0.05,M=1000,qut.standardize=TRUE,
intercept=TRUE,offset=NULL,bootstrap=TRUE,sigma=ifelse(n>2*p,'ols','qut'),beta0='iterglm',
estimator='unbiased',type=c('glmnet','lars','flare'),lambda.seq=0,penalty.factor=rep(1,p),
lambda.min.ratio=ifelse(n
response variable. Quantitative for family=gaussian
, or family=poisson
(non-negative counts). For family=binomial
should be a factor with two levels.
input matrix, of dimension n x p; each row is an observation vector.
a user supplied glmnet
or lars object. Typical usage is to leave it empty so that the program computes the regularization path using the algorithm selected in type
. WARNING: use with care, if supplied, object options must match with user supplied options.
response type (see above). Default is gaussian
.
level, such that quantile \(\tau=(1-\)alpha.level
). Default is 0.05.
number of Monte Carlo Simulations to estimate the distribution \(\Lambda\). Default is 1000.
standardize matrix X with a quantile-based standardization. Default is TRUE. It is not used for sqrt-lasso.
should intercept(s) be fitted (default=TRUE) or set to zero (FALSE).
a vector of length n
that is included in the linear predictor. Useful for the poisson
family (e.g. log of exposure time), or for refining a model by starting at a current fit. Default is NULL.
set TRUE if it is desired to bootstrap matrix X when computing the Quantile Universal Threshold (Random scenario). Default is TRUE.
standard deviation of the Gaussian errors. Used only if family=gaussian
. When sigma = 'qut', it is estimated based on the Quantile Universal Threshold (default if n <= 2p
); when sigma = 'rcv', it is estimated using Refitted Cross Validation in Fan et al. 2012; and when sigma = 'cv', it is estimated using cross validation as in Reid et al. 2013. If sigma
is a positive real number, then that value is used for the standard deviation.
If n>p
and sigma='ols' it is estimated using the ordinary least squeares estimator (default if n>2p
)
coefficients of the unpenalized covariates for generating the null data for the Quantile Universal Threshold. When is 'iterglm' (Default) or 'iter', it is estimated using one step iteration of the entire procedure with maximum likelihood estimation or the lasso estimation, respectively. If 'noiter' then it is estimated without iterating.
If it is desired to set beta0
in advance, then it should be a vector of size the number of unpenalized covariates including the intercept if intercept=TRUE
, in the same order. If there are not unpenalized covariates and intercept=TRUE
, then it must be a real number.
type of estimation of sigma when sigma
= 'qut'. It can be equal to 'unbiased' (standard unbiased formula), or 'mle' (maximum likelihood formula).
algorithm for solving the optimization problem. It can be lars
(type
='lars') or glmnet
(type
='glmnet') for GLM-lasso, or flare
(type
='flare') for Square-root lasso. For GLM-lasso, if family is not gaussian, penalty.factor is different from default, or offset different from NULL, glmnet will be always used. Default is 'glmnet'.
preset lambda sequence when type = 'glmnet'. If lambda.seq
<2 the sequence of lambdas goes decreasing from lambda.max
to lambda.qut
. If lambda.seq
= 0, lambda
sequence is equispaced. If lambda.seq
= 1, lambda
sequence is equispaced in the log scale. Use lambda.seq
=2 for glmnet
default options. Default is 0.
separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Note: the penalty factors are internally rescaled to sum to n, and the lambda sequence will reflect this change.
smallest value for lambda, as a fraction of lambda.max
. As in glmnet
.
the number of lambda
. As in glmnet
. Default is 100.
a user supplied lambda
sequence. As in glmnet
. Not used when type
='flare'.
glmnet
or lars
options.
value of the Quantile Universal Threshold.
object fitted by glmnet
or lars
.
coefficients obtained with the Quantile Universal Threshold.
coefficients obtained fitting GLM with the non zero coefficients in beta
.
estimated value of the intercept when family is not gaussian
.
response type
standard deviation estimate of the errors (when family=gaussian
)
scale factor used for standardizing \(X\).
C. Giacobino, J. Diaz, S. Sardy, N. Hengartner. Quantile universal threshold for model selection. 2016 Jianqing Fan, Shaojun Guo and Ning Hao. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B. 2012 Stephen Reid, Robert Tibshirani, and Jerome Friedman. A Study of Error Variance Estimation in Lasso Regression. 2013
# NOT RUN {
set.seed(1234)
X=matrix(rnorm(50*500),50,500)
beta=c(rep(10,5),rep(0,500-5))
y=X %*% beta+rnorm(50)
outqut=qut(y,X,type='glmnet',family=gaussian,sigma=1) #Fitting with qut
betaqut=outqut$beta[-1]
outcv=cv.glmnet(X,y,family='gaussian') #fitting with Cross-Validation
betacv=coef(outcv$glmnet.fit,s=outcv$lambda.min)[-1]
results=rbind( c(sum(betaqut[1:5]!=0),sum(betaqut[-(1:5)]!=0)),
c(sum( betacv[1:5]!=0), sum(betacv[-(1:5)]!=0)) )
colnames(results)=c('True Positive','False Positive')
rownames(results)=c('qut','cv')
print(results)
# }
Run the code above in your browser using DataLab