Learn R Programming

customizedTraining (version 1.2)

cv.customizedGlmnet: cross validation for customizedGlmnet

Description

Does k-fold cross-validation for customizedGlmnet and returns a values for G and lambda

Usage

cv.customizedGlmnet(xTrain, yTrain, xTest = NULL, groupid = NULL, Gs = NULL,
    dendrogram = NULL, dendrogramCV = NULL, lambda = NULL,
    nfolds = 10, foldid = NULL, keep = FALSE,
    family = c("gaussian", "binomial", "multinomial"), verbose = FALSE)

Arguments

xTrain

an n-by-p matrix of training covariates

yTrain

a length-n vector of training responses. Numeric for family = "gaussian". Factor or character for family = "binomial" or family = "multinomial"

xTest

an m-by-p matrix of test covariates. May be left NULL, in which case cross validation predictions are made internally on the training set and no test predictions are returned.

groupid

an optional length-m vector of group memberships for the test set. If specified, customized training subsets are identified using the union of nearest neighbor sets for each test group, in which case cross-validation is used only to select the regularization parameter lambda, not the number of clusters G. Either groupid or Gs must be specified

Gs

a vector of positive integers indicating the numbers of clusters over which to perform cross-validation to determine the best number. Ignored if groupid is specified. Either groupid or Gs must be specified

dendrogram

optional output from hclust on the joint covariate data. Useful if method is being used several times to avoid redundancy in calculations

dendrogramCV

optional output from hclust on the training covariate data. Used as joint clustering result for cross-validation. Useful to specify in advance if method is being used several times to avoid redundancy in calculations

lambda

sequence of values to use for the regularization parameter lambda. Recomended to leave as NULL and allow glmnet to choose automatically.

nfolds

number of folds -- default is 10. Ignored if foldid is specified

foldid

an optional length-n vector of fold memberships used for cross-validation

keep

Should fitted values on the training set from cross validation be included in output? Default is FALSE.

family

response type

verbose

Should progress be printed to console as folds are evaluated during cross-validation? Default is FALSE.

Value

an object of class cv.customizedGlmnet

call

the call that produced this object

G.min

unless groupid is specified, the number of clusters minimizing CV error

lambda

the sequence of values of the regularization parameter lambda considered

lambda.min

the value of the regularization parameter lambda minimizing CV error

error

a matrix containing the CV error for each G and lambda

fit

a customizedGlmnet object fit using G.min and lambda.min. Only returned if xTest is not NULL.

prediction

a length-m vector of predictions for the test set, using the tuning parameters which minimize cross-validation error. Only returned if xTest is not NULL.

selected

a list of nonzero variables for each customized training set, using G.min and lambda.min. Only returned if xTest is not NULL.

cv.fit

a array containing fitted values on the training set from cross validation. Only returned if keep is TRUE.

References

Scott Powers, Trevor Hastie and Robert Tibshirani (2015) "Customized training with an application to mass specrometric imaging of gastric cancer data." Annals of Applied Statistics 9, 4:1709-1725.

See Also

customizedGlmnet, plot.cv.customizedGlmnet, predict.cv.customizedGlmnet

Examples

Run this code
# NOT RUN {
require(glmnet)

# Simulate synthetic data

n = m = 150
p = 50
q = 5
K = 3
sigmaC = 10
sigmaX = sigmaY = 1
set.seed(5914)

beta = matrix(0, nrow = p, ncol = K)
for (k in 1:K) beta[sample(1:p, q), k] = 1
c = matrix(rnorm(K*p, 0, sigmaC), K, p)
eta = rnorm(K)
pi = (exp(eta)+1)/sum(exp(eta)+1)
z = t(rmultinom(m + n, 1, pi))
x = crossprod(t(z), c) + matrix(rnorm((m + n)*p, 0, sigmaX), m + n, p)
y = rowSums(z*(crossprod(t(x), beta))) + rnorm(m + n, 0, sigmaY)

x.train = x[1:n, ]
y.train = y[1:n]
x.test = x[n + 1:m, ]
y.test = y[n + 1:m]
foldid = sample(rep(1:10, length = nrow(x.train)))


# Example 1: Use clustering to fit the customized training model to training
# and test data with no predefined test-set blocks

fit1 = cv.customizedGlmnet(x.train, y.train, x.test, Gs = c(1, 2, 3, 5),
    family = "gaussian", foldid = foldid)

# Print the optimal number of groups and value of lambda:
fit1$G.min
fit1$lambda.min

# Print the customized training model fit:
fit1

# Compute test error using the predict function:
mean((y[n + 1:m] - predict(fit1))^2)

# Plot nonzero coefficients by group:
plot(fit1)


# Example 2: If the test set has predefined blocks, use these blocks to define
# the customized training sets, instead of using clustering.
foldid = apply(z == 1, 1, which)[1:n]
group.id = apply(z == 1, 1, which)[n + 1:m]

fit2 = cv.customizedGlmnet(x.train, y.train, x.test, group.id, foldid = foldid)

# Print the optimal value of lambda:
fit2$lambda.min

# Print the customized training model fit:
fit2

# Compute test error using the predict function:
mean((y[n + 1:m] - predict(fit2))^2)

# Plot nonzero coefficients by group:
plot(fit2)


# Example 3: If there is no test set, but the training set is organized into
# blocks, you can do cross validation with these blocks as the basis for the
# customized training sets.

fit3 = cv.customizedGlmnet(x.train, y.train, foldid = foldid)

# Print the optimal value of lambda:
fit3$lambda.min

# Print the customized training model fit:
fit3

# Compute test error using the predict function:
mean((y[n + 1:m] - predict(fit3))^2)

# Plot nonzero coefficients by group:
plot(fit3)
# }

Run the code above in your browser using DataLab