cv.oem: Cross validation for Orthogonalizing EM

Description

Cross validation for Orthogonalizing EM

Usage

cv.oem(
  x,
  y,
  penalty = c("elastic.net", "lasso", "ols", "mcp", "scad", "mcp.net", "scad.net",
    "grp.lasso", "grp.lasso.net", "grp.mcp", "grp.scad", "grp.mcp.net", "grp.scad.net",
    "sparse.grp.lasso"),
  weights = numeric(0),
  lambda = NULL,
  type.measure = c("mse", "deviance", "class", "auc", "mae"),
  nfolds = 10,
  foldid = NULL,
  grouped = TRUE,
  keep = FALSE,
  parallel = FALSE,
  ncores = -1,
  ...
)

Value

An object with S3 class "cv.oem"

Arguments

x: input matrix of dimension n x p or CsparseMatrix objects of the Matrix (sparse not yet implemented. Each row is an observation, each column corresponds to a covariate. The cv.oem() function is optimized for n >> p settings and may be very slow when p > n, so please use other packages such as glmnet, ncvreg, grpreg, or gglasso when p > n or p approx n.
y: numeric response vector of length nobs.
penalty: Specification of penalty type in lowercase letters. Choices include "lasso", "ols" (Ordinary least squares, no penaly), "elastic.net", "scad", "mcp", "grp.lasso"
weights: observation weights. defaults to 1 for each observation (setting weight vector to length 0 will default all weights to 1)
lambda: A user supplied lambda sequence. By default, the program computes its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this.
type.measure: measure to evaluate for cross-validation. The default is type.measure = "deviance", which uses squared-error for gaussian models (a.k.a type.measure = "mse" there), deviance for logistic regression. type.measure = "class" applies to binomial only. type.measure = "auc" is for two-class logistic regression only. type.measure = "mse" or type.measure = "mae" (mean absolute error) can be used by all models; they measure the deviation from the fitted mean to the response.
nfolds: number of folds for cross-validation. default is 10. 3 is smallest value allowed.
foldid: an optional vector of values between 1 and nfold specifying which fold each observation belongs to.
grouped: Like in glmnet, this is an experimental argument, with default TRUE, and can be ignored by most users. For all models, this refers to computing nfolds separate statistics, and then using their mean and estimated standard error to describe the CV curve. If grouped = FALSE, an error matrix is built up at the observation level from the predictions from the nfold fits, and then summarized (does not apply to type.measure = "auc").
keep: If keep = TRUE, a prevalidated list of arrasy is returned containing fitted values for each observation and each value of lambda for each model. This means these fits are computed with this observation and the rest of its fold omitted. The folid vector is also returned. Default is keep = FALSE
parallel: If TRUE, use parallel foreach to fit each fold. Must register parallel before hand, such as doMC.
ncores: Number of cores to use. If parallel = TRUE, then ncores will be automatically set to 1 to prevent conflicts
...: other parameters to be passed to "oem" function

References

Huling. J.D. and Chien, P. (2022), Fast Penalized Regression and Cross Validation for Tall Data with the oem Package. Journal of Statistical Software 104(6), 1-24. doi:10.18637/jss.v104.i06

Examples

Run this code

set.seed(123)
n.obs <- 1e4
n.vars <- 100

true.beta <- c(runif(15, -0.25, 0.25), rep(0, n.vars - 15))

x <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)
y <- rnorm(n.obs, sd = 3) + x %*% true.beta

fit <- cv.oem(x = x, y = y, 
              penalty = c("lasso", "grp.lasso"), 
              groups = rep(1:20, each = 5))

layout(matrix(1:2, ncol = 2))
plot(fit)
plot(fit, which.model = 2)

Run the code above in your browser using DataLab