grpregOverlap: Fit penalized regression models with overlapping grouped variables

Description

Fit the regularization paths of linear, logistic, Poisson or Cox models with overlapping grouped covariates based on the latent group lasso approach (Jacob et al., 2009; Obozinski et al., 2011). Latent group MCP/SCAD as well as bi-level selection methods, namely the group exponential lasso (Breheny, 2015) and the composite MCP (Huang et al., 2012) are also available.

Usage

grpregOverlap(X, y, group,  penalty=c("grLasso", "grMCP", "grSCAD", "gel", "cMCP", "gLasso", "gMCP"),  family=c("gaussian","binomial", "poisson", "cox"), nlambda=100, lambda,  lambda.min={if (nrow(X) > ncol(X)) 1e-4 else .05}, alpha=1, eps=.001,  max.iter=1000, dfmax=ncol(X), gmax=length(group),  gamma=ifelse(penalty == "grSCAD", 4, 3), tau=1/3,  group.multiplier,  returnX = FALSE, returnOverlap = FALSE, warn=TRUE, ...)

Arguments

The design matrix, without an intercept. grpregOverlap calls grpreg, which standardizes the data and includes an intercept by default.

The response vector, or a matrix in the case of multitask learning. For survival analysis, y is the time-to-event outcome - a two-column matrix or Surv object. The first column is the time on study (follow up time); the second column is a binary variable with 1 indicating that the event has occurred and 0 indicating (right) censoring. See grpreg and grpsurv for more details.

group

Different from that in grpreg, group here must be a list of vectors, each containing integer indices or character names of variables in the group. variables that not belong to any groups will be disgarded.

penalty

The penalty to be applied to the model. Specify grLasso, grMCP, or grSCAD for group selection. Or specify gel or cMCP for bi-level selection, i.e., selecting important groups as well as important variables in those groups. See grpreg for more details.

family

Either "gaussian", "binomial", or 'cox', depending on the response. If family is missing, it is set to be 'gaussian'. Specify family = 'cox' for survival analysis (Cox models).

nlambda

The number of lambda values. Default is 100.

lambda

A user supplied sequence of lambda values. Typically, this is left unspecified, and the function automatically computes a grid of lambda values that ranges uniformly on the log scale over the relevant range of lambda values.

lambda.min

The smallest value for lambda, as a fraction of lambda.max. Default is .0001 if the number of observations is larger than the number of covariates and .05 otherwise.

alpha

Adopted from grpreg, the L2 (ridge) penalty is also allowed along with the group penalty. alpha controls the proportional weight of the regularization parameters of these two penalties. The regularization parameter of the group penalty is lambda*alpha, while that of the ridge penalty is lambda*(1-alpha). Default is 1: no L2 penalty.

eps

Convergence threshhold. The algorithm iterates until the change (on the standardized scale) in any coefficient is less than eps. Default is .001.

max.iter

The maximum number of iterations. Default is 1000. See grpreg for more details.

dfmax

Limit on the number of parameters allowed to be nonzero. If this limit is exceeded, the algorithm will exit early from the regularization path. Default is the total number of covariates.

gmax

Limit on the number of groups allowed to have nonzero elements. If this limit is exceeded, the algorithm will exit early from the regularization path. Default is the total number of groups.

gamma

Tuning parameter of the MCP penalty; defaults to 3.

tau

Tuning parameter for the group exponential lasso; defaults to 1/3.

group.multiplier

A vector of values representing multiplicative factors by which each group's penalty is to be multiplied. Often, this is a function (such as the square root) of the number of predictors in each group. If this is not specified by the user, the internal code will, by default, use the square root of group size for the group selection methods, and a vector of 1's (i.e., no adjustment for group size) for bi-level selection.

returnX

Return the new expanded design matrix? Default is FALSE. Note the storage size of this new matrix can be very large.

returnOverlap

Return the matrix containing overlapps? Default is FALSE. It is a square matrix $C$ such that $C[i, j]$ is the number of overlapped variables between group i and j. Diagonal value $C[i, i]$ is therefore the number of variables in group i.

warn

Should the function give a warning if it fails to converge? Default is TRUE. See grpreg for more details.

...

Not used currently.

Value

An object with S3 class "grpregOverlap" or "grpsurvOverlap" (for Cox models), which inherits "grpreg", with following variables.

Details

The latent group lasso approach extends the group lasso to group variable selection with overlaps. The proposed latent group lasso penalty is formulated in a way such that it's equivalent to a classical non-overlapping group lasso problem in an new space, which is expanded by duplicating the columns of overlapped variables. For technical details, see (Jacob et al., 2009) and (Obozinski et al., 2011).

grpregOverlap takes input design matrix X and grouping information group, and expands X to the new, non-overlapping space. It then calls grpreg for modeling fitting based on group decent algorithm. Unlike in grpreg, the interface for group bridge-penalized method is not implemented.

The expanded design matrix is named X.latent. It is a returned value in the fitted object, provided returnX is TRUE. The latent coeffecient (or norm) vector then corresponds to that. Note thaT when constructing X.latent, the columns in X corresponding to those variables not included in group will be removed automatically.

For more detailed explanation for the penalties and algorithm, see grpreg.

References

Zeng, Y., and Breheny, P. (2016). Overlapping Group Logistic Regression with Applications to Genetic Pathway Selection. Cancer Informatics, 15, 179-187. http://doi.org/10.4137/CIN.S40043.
Jacob, L., Obozinski, G., and Vert, J. P. (2009, June). Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, ACM: 433-440. http://www.machinelearning.org/archive/icml2009/papers/471.pdf
Obozinski, G., Jacob, L., and Vert, J. P. (2011). Group lasso with overlaps: the latent group lasso approach. http://arxiv.org/abs/1110.0413.
Breheny, P. and Huang, J. (2009) Penalized methods for bi-level variable selection. Statistics and its interface, 2: 369-380. http://myweb.uiowa.edu/pbreheny/publications/Breheny2009.pdf
Huang J., Breheny, P. and Ma, S. (2012). A selective review of group selection in high dimensional models. Statistical Science, 27: 481-499. http://myweb.uiowa.edu/pbreheny/publications/Huang2012.pdf
Breheny P and Huang J (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25: 173-187.http://myweb.uiowa.edu/pbreheny/publications/group-computing.pdf
Breheny P and Huang J (2009). Penalized methods for bi-level variable selection. Statistics and Its Interface, 2: 369-380. http://myweb.uiowa.edu/pbreheny/publications/Breheny2009.pdf
Breheny P (2014). R package 'grpreg'. https://CRAN.R-project.org/package=grpreg/grpreg.pdf

Examples

Run this code

## linear regression, a simulation demo.
set.seed(123)
group <- list(gr1 = c(1, 2, 3), gr2 = c(1, 4), gr3 = c(2, 4, 5), 
              gr4 = c(3, 5), gr5 = c(6))
beta.latent.T <- c(5, 5, 5, 0, 0, 0, 0, 0, 5, 5, 0) # true latent coefficients.
# beta.T <- c(5, 5, 10, 0, 5, 0), true variables: 1, 2, 3, 5; true groups: 1, 4.
X <- matrix(rnorm(n = 6*100), ncol = 6)  
X.latent <- expandX(X, group)
y <- X.latent %*% beta.latent.T + rnorm(100)

fit <- grpregOverlap(X, y, group, penalty = 'grLasso')
# fit <- grpregOverlap(X, y, group, penalty = 'grMCP')
# fit <- grpregOverlap(X, y, group, penalty = 'grSCAD')
head(coef(fit, latent = TRUE)) # compare to beta.latent.T
plot(fit, latent = TRUE) 
head(coef(fit, latent = FALSE)) # compare to beta.T
plot(fit, latent = FALSE)

cvfit <- cv.grpregOverlap(X, y, group, penalty = 'grMCP')
plot(cvfit)
head(coef(cvfit))
summary(cvfit)

## logistic regression, real data, pathway selection
data(pathway.dat)
X <- pathway.dat$expression
group <- pathway.dat$pathways
y <- pathway.dat$mutation
fit <- grpregOverlap(X, y, group, penalty = 'grLasso', family = 'binomial')
plot(fit)
str(select(fit))
str(select(fit,criterion="AIC",df="active"))

## Not run: 
# cvfit <- cv.grpregOverlap(X, y, group, penalty = 'grLasso', family = 'binomial')
# coef(cvfit)
# predict(cvfit, X, type='response')
# predict(cvfit, X, type = 'class')
# plot(cvfit)
# plot(cvfit, type = 'all')
# summary(cvfit)
# ## End(Not run)

Run the code above in your browser using DataLab