Cross validation for Orthogonalizing EM
cv.oem(
x,
y,
penalty = c("elastic.net", "lasso", "ols", "mcp", "scad", "mcp.net", "scad.net",
"grp.lasso", "grp.lasso.net", "grp.mcp", "grp.scad", "grp.mcp.net", "grp.scad.net",
"sparse.grp.lasso"),
weights = numeric(0),
lambda = NULL,
type.measure = c("mse", "deviance", "class", "auc", "mae"),
nfolds = 10,
foldid = NULL,
grouped = TRUE,
keep = FALSE,
parallel = FALSE,
ncores = -1,
...
)
An object with S3 class "cv.oem"
input matrix of dimension n x p or CsparseMatrix
objects of the Matrix (sparse not yet implemented.
Each row is an observation, each column corresponds to a covariate. The cv.oem() function
is optimized for n >> p settings and may be very slow when p > n, so please use other packages
such as glmnet
, ncvreg
, grpreg
, or gglasso
when p > n or p approx n.
numeric response vector of length nobs.
Specification of penalty type in lowercase letters. Choices include "lasso"
,
"ols"
(Ordinary least squares, no penaly), "elastic.net"
, "scad"
, "mcp"
, "grp.lasso"
observation weights. defaults to 1 for each observation (setting weight vector to length 0 will default all weights to 1)
A user supplied lambda sequence. By default, the program computes its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this.
measure to evaluate for cross-validation. The default is type.measure = "deviance"
,
which uses squared-error for gaussian models (a.k.a type.measure = "mse"
there), deviance for logistic
regression. type.measure = "class"
applies to binomial only. type.measure = "auc"
is for two-class logistic
regression only. type.measure = "mse"
or type.measure = "mae"
(mean absolute error) can be used by all models;
they measure the deviation from the fitted mean to the response.
number of folds for cross-validation. default is 10. 3 is smallest value allowed.
an optional vector of values between 1 and nfold specifying which fold each observation belongs to.
Like in glmnet, this is an experimental argument, with default TRUE
, and can be ignored by most users.
For all models, this refers to computing nfolds separate statistics, and then using their mean and estimated standard
error to describe the CV curve. If grouped = FALSE
, an error matrix is built up at the observation level from the
predictions from the nfold
fits, and then summarized (does not apply to type.measure = "auc"
).
If keep = TRUE
, a prevalidated list of arrasy is returned containing fitted values for each observation
and each value of lambda for each model. This means these fits are computed with this observation and the rest of its
fold omitted. The folid vector is also returned. Default is keep = FALSE
If TRUE, use parallel foreach to fit each fold. Must register parallel before hand, such as doMC.
Number of cores to use. If parallel = TRUE
, then ncores will be automatically set to 1 to prevent conflicts
other parameters to be passed to "oem"
function
Huling. J.D. and Chien, P. (2022), Fast Penalized Regression and Cross Validation for Tall Data with the oem Package. Journal of Statistical Software 104(6), 1-24. doi:10.18637/jss.v104.i06
set.seed(123)
n.obs <- 1e4
n.vars <- 100
true.beta <- c(runif(15, -0.25, 0.25), rep(0, n.vars - 15))
x <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)
y <- rnorm(n.obs, sd = 3) + x %*% true.beta
fit <- cv.oem(x = x, y = y,
penalty = c("lasso", "grp.lasso"),
groups = rep(1:20, each = 5))
layout(matrix(1:2, ncol = 2))
plot(fit)
plot(fit, which.model = 2)
Run the code above in your browser using DataLab