scglrCrossVal: Function that fits and selects the number of component by cross-validation.

Description

Function that fits and selects the number of component by cross-validation.

Usage

scglrCrossVal(formula, data, family, K = 1, nfolds = 5,
  type = "mspe", size = NULL, offset = NULL, subset = NULL,
  na.action = na.omit, crit = list(), method = methodSR(),
  mc.cores = 1)

Value

a matrix containing the criterion values for each response (rows) and each number of components (columns).

Arguments

formula: an object of class "Formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted.
data: the data frame to be modeled.
family: a vector of character of length q specifying the distributions of the responses. Bernoulli, binomial, poisson and gaussian are allowed.
K: number of components, default is one.
nfolds: number of folds, default is 5. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets.
type: loss function to use for cross-validation. Currently six options are available depending on whether the responses are of the same distribution family. If the responses are all bernoulli distributed, then the prediction performance may be measured through the area under the ROC curve: type = "auc" In any other case one can choose among the following five options ("likelihood","aic","aicc","bic","mspe").
size: specifies the number of trials of the binomial variables included in the model. A (n*qb) matrix is expected for qb binomial variables.
offset: used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected.
subset: an optional vector specifying a subset of observations to be used in the fitting process.
na.action: a function which indicates what should happen when the data contain NAs. The default is set to the na.omit.
crit: a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively.
method: Regularization criterion type. Object of class "method.SCGLR" built by methodSR for Structural Relevance.
mc.cores: max number of cores to use when using parallelization (Not available in windows yet and strongly discouraged if in interactive mode).

References

Bry X., Trottier C., Verron T. and Mortier F. (2013) Supervised Component Generalized Linear Regression using a PLS-extension of the Fisher scoring algorithm. Journal of Multivariate Analysis, 119, 47-60.

Examples

Run this code

if (FALSE) {
library(SCGLR)

# load sample data
data(genus)

# get variable names from dataset
n <- names(genus)
ny <- n[grep("^gen",n)]    # Y <- names that begins with "gen"
nx <- n[-grep("^gen",n)]   # X <- remaining names

# remove "geology" and "surface" from nx
# as surface is offset and we want to use geology as additional covariate
nx <-nx[!nx%in%c("geology","surface")]

# build multivariate formula
# we also add "lat*lon" as computed covariate
form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),A=c("geology"))

# define family
fam <- rep("poisson",length(ny))

# cross validation
genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12,
 offset=genus$surface)

# find best K
mean.crit <- colMeans(log(cv))

#plot(mean.crit, type="l")
}

Run the code above in your browser using DataLab