Does k-fold cross-validation for sail and determines the optimal tuning parameter \(\lambda\).
cv.sail(x, y, e, ..., weights, lambda = NULL, type.measure = c("mse",
"deviance", "class", "auc", "mae"), nfolds = 10, foldid,
grouped = TRUE, keep = FALSE, parallel = FALSE)
input matrix of dimension n x p
, where n
is the number
of subjects and p is number of X variables. Each row is an observation
vector. Can be a high-dimensional (n < p) matrix. Can be a user defined
design matrix of main effects only (without intercept) if
expand=FALSE
response variable. For family="gaussian"
should be a 1 column
matrix or numeric vector. For family="binomial"
, should be a 1
column matrix or numeric vector with -1 for failure and 1 for success.
exposure or environment vector. Must be a numeric vector. Factors must be converted to numeric.
other arguments that can be passed to sail
observation weights. Default is 1 for each observation. Currently NOT IMPLEMENTED.
Optional user-supplied lambda sequence; default is NULL, and
sail
chooses its own sequence
loss to use for cross-validation. Currently only 3
options are implemented. The default is type.measure="deviance"
,
which uses squared-error for gaussian models (and is equivalent to
type.measure="mse"
) there). type.measure="mae"
(mean absolute
error) can also be used which measures the absolute deviation from the
fitted mean to the response (\(|y-\hat{y}|\)).
number of folds. Although nfolds
can be as large as the
sample size (leave-one-out CV), it is not recommended for large datasets.
Smallest value allowable is nfolds=3
. Default: 10
an optional vector of values between 1 and nfold
identifying what fold each observation is in. If supplied,nfold
can
be missing. Often used when wanting to tune the second tuning parameter
(\(\alpha\)) as well (see details).
This is an experimental argument, with default TRUE
,
and can be ignored by most users. This refers to computing nfolds
separate statistics, and then using their mean and estimated standard error
to describe the CV curve. If grouped=FALSE
, an error matrix is built
up at the observation level from the predictions from the nfold
fits, and then summarized (does not apply to type.measure="auc"
).
Default: TRUE.
If keep=TRUE
, a prevalidated array is returned
containing fitted values for each observation and each value of
lambda
. This means these fits are computed with this observation and
the rest of its fold omitted. The folid
vector is also returned.
Default: FALSE
If TRUE
, use parallel foreach
to fit each fold.
Must register parallel before hand using the
registerDoParallel
function from the doParallel
package. See
the example below for details. Default: FALSE
an object of class "cv.sail"
is returned, which is a list with
the ingredients of the cross-validation fit.
the
values of converged lambda
used in the fits.
The mean
cross-validated error - a vector of length length(lambda)
.
estimate of standard error of cvm
.
upper
curve = cvm+cvsd
.
lower curve = cvm-cvsd
.
number of non-zero coefficients at each lambda
. This is
the sum of the total non-zero main effects and interactions. Note that when
expand=TRUE
, we only count a variable once in the calculation of
nzero
, i.e., if a variable is expanded to three columns, then this
is only counted once even though all three coefficients are estimated to be
non-zero
a text string indicating type of measure (for plotting purposes).
a fitted sail
object for the full
data.
value of lambda
that gives minimum
cvm
.
largest value of lambda
such that
error is within 1 standard error of the minimum.
if
keep=TRUE
, this is the array of prevalidated fits. Some entries can
be NA
, if that and subsequent values of lambda
are not
reached for that fold
if keep=TRUE
, the fold
assignments used
The function runs sail
nfolds
+1 times; the
first to get the lambda
sequence, and then the remainder to compute
the fit with each of the folds omitted. Note that a new lambda sequence is
computed for each of the folds and then we use the predict
method to
get the solution path at each value of the original lambda sequence. The
error is accumulated, and the average error and standard deviation over the
folds is computed. Note that cv.sail
does NOT search for values for
alpha
. A specific value should be supplied, else alpha=0.5
is
assumed by default. If users would like to cross-validate alpha
as
well, they should call cv.sail
with a pre-computed vector
foldid
, and then use this same fold vector in separate calls to
cv.sail
with different values of alpha
. Note also that the
results of cv.sail
are random, since the folds are selected at
random. Users can reduce this randomness by running cv.sail
many
times, and averaging the error curves.
Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. http://www.jstatsoft.org/v33/i01/.
Bhatnagar SR, Yang Y, Greenwood CMT. Sparse additive interaction models with the strong heredity property (2018+). Preprint.
# NOT RUN {
f.basis <- function(i) splines::bs(i, degree = 3)
data("sailsim")
# Parallel
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
cvfit <- cv.sail(x = sailsim$x, y = sailsim$y, e = sailsim$e,
parallel = TRUE, nlambda = 10,
maxit = 25, basis = f.basis,
nfolds = 3, dfmax = 5)
stopCluster(cl)
# plot cross validated curve
plot(cvfit)
# solution at lambda.min
coef(cvfit, s = "lambda.min")
# solution at lambda.1se
coef(cvfit, s = "lambda.1se")
# non-zero coefficients at lambda.min
predict(cvfit, s = "lambda.min", type = "nonzero")
# }
Run the code above in your browser using DataLab