lm.pels.fit: Regularised fit of sparse linear regression

Description

This function fits a sparse linear model between a scalar response and a vector of scalar covariates. It employs a penalised least-squares regularisation procedure, with either (group)SCAD or (group)LASSO penalties. The method utilises an objective criterion (criterion) to select the optimal regularisation parameter (lambda.opt).

Usage

lm.pels.fit(z, y, lambda.min = NULL, lambda.min.h = NULL, lambda.min.l = NULL,
factor.pn = 1, nlambda = 100, lambda.seq = NULL, vn = ncol(z), nfolds = 10, 
seed = 123, criterion = "GCV", penalty = "grSCAD", max.iter = 1000)

Value

call: The matched call.
fitted.values: Estimated scalar response.
residuals: Differences between y and the fitted.values.
beta.est: Estimate of $\beta_0$ when the optimal penalisation parameter lambda.opt and vn.opt are used.
indexes.beta.nonnull: Indexes of the non-zero $\hat{\beta_{j}}$.
lambda.opt: Selected value of lambda.
IC: Value of the criterion function considered to select lambda.opt and vn.opt.
vn.opt: Selected value of vn.
...

Arguments

z: Matrix containing the observations of the covariates collected by row.
y: Vector containing the scalar response.
lambda.min: The smallest value for lambda (i. e., the lower endpoint of the sequence in which lambda.opt is selected), as fraction of lambda.max. The defaults is lambda.min.l if the sample size is larger than factor.pn times the number of linear covariates and lambda.min.h otherwise.
lambda.min.h: The lower endpoint of the sequence in which lambda.opt is selected if the sample size is smaller than factor.pn times the number of linear covariates. The default is 0.05.
lambda.min.l: The lower endpoint of the sequence in which lambda.opt is selected if the sample size is larger than factor.pn times the number of linear covariates. The default is 0.0001.
factor.pn: Positive integer used to set lambda.min. The default value is 1.
nlambda: Positive integer indicating the number of values in the sequence from which lambda.opt is selected. The default is 100.
lambda.seq: Sequence of values in which lambda.opt is selected. If lambda.seq=NULL, then the programme builds the sequence automatically using lambda.min and nlambda.
vn: Positive integer or vector of positive integers indicating the number of groups of consecutive variables to be penalised together. The default value is vn=ncol(z), resulting in the individual penalization of each scalar covariate.
nfolds: Number of cross-validation folds (used when criterion="k-fold-CV"). Default is 10.
seed: You may set the seed for the random number generator to ensure reproducible results (applicable when criterion="k-fold-CV" is used). The default seed value is 123.
criterion: The criterion used to select the regularisation parameter lambda.opt (also vn.opt if needed). Options include "GCV", "BIC", "AIC", or "k-fold-CV". The default setting is "GCV".
penalty: The penalty function applied in the penalised least-squares procedure. Currently, only "grLasso" and "grSCAD" are implemented. The default is "grSCAD".
max.iter: Maximum number of iterations allowed across the entire path. The default value is 1000.

Author

German Aneiros Perez german.aneiros@udc.es

Silvia Novo Diaz snovo@est-econ.uc3m.es

Details

The sparse linear model (SLM) is given by the expression: $$ Y_i=Z_{i1}\beta_{01}+\dots+Z_{ip_n}\beta_{0p_n}+\varepsilon_i\ \ \ i=1,\dots,n, $$ where $Y_i$ denotes a scalar response, $Z_{i1},\dots,Z_{ip_n}$ are real covariates. In this equation, $\mathbf{\beta}_0=(\beta_{01},\dots,\beta_{0p_n})^{\top}$ is a vector of unknown real parameters and $\varepsilon_i$ represents the random error.

In this function, the SLM is fitted using a penalised least-squares (PeLS) approach by minimising $$ \mathcal{Q}\left(\mathbf{\beta}\right)=\frac{1}{2}\left(\mathbf{Y}-\mathbf{Z}\mathbf{\beta}\right)^{\top}\left(\mathbf{Y}-\mathbf{Z}\mathbf{\beta}\right)+n\sum_{j=1}^{p_n}\mathcal{P}_{\lambda_{j_n}}\left(|\beta_j|\right), \quad (1) $$ where $\mathbf{\beta}=(\beta_1,\ldots,\beta_{p_n})^{\top}, \ \mathcal{P}_{\lambda_{j_n}}\left(\cdot\right)$ is a penalty function (specified in the argument penalty) and $\lambda_{j_n} > 0$ is a tuning parameter. To reduce the number of tuning parameters, $\lambda_j$, to be selected for each sample, we consider $\lambda_j = \lambda \widehat{\sigma}_{\beta_{0,j,OLS}}$, where $\beta_{0,j,OLS}$ denotes the OLS estimate of $\beta_{0,j}$ and $\widehat{\sigma}_{\beta_{0,j,OLS}}$ is the estimated standard deviation. The parameter $\lambda$ is selected using the objetive criterion specified in the argument criterion.

For further details on the estimation procedure of the SLM, see e.g. Fan and Li. (2001). The PeLS objective function is minimised using the R function grpreg of the package grpreg (Breheny and Huang, 2015).

Remark: It should be noted that if we set lambda.seq to $=0$, we obtain the non-penalised estimation of the model, i.e. the OLS estimation. Using lambda.seq with a vaule $\not=0$ is advisable when suspecting the presence of irrelevant variables.

References

Breheny, P., and Huang, J. (2015) Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25, 173--187, tools:::Rd_expr_doi("https://doi.org/10.1007/s11222-013-9424-2").

Fan, J., and Li, R. (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348--1360, tools:::Rd_expr_doi("https://doi.org/10.1198/016214501753382273").

Examples

Run this code

data("Tecator")
y<-Tecator$fat
z1<-Tecator$protein       
z2<-Tecator$moisture

#Quadratic, cubic and interaction effects of the scalar covariates.
z.com<-cbind(z1,z2,z1^2,z2^2,z1^3,z2^3,z1*z2)
train<-1:160

#LM fit 
ptm=proc.time()
fit<-lm.pels.fit(z=z.com[train,], y=y[train],lambda.min.h=0.02,
      lambda.min.l=0.01,factor.pn=2, max.iter=5000, criterion="BIC")
proc.time()-ptm

#Results
fit
names(fit)

Run the code above in your browser using DataLab