tune: Hyperparameter tuning for classifiers

Description

Most classifiers implemented in this package depend on one or even several hyperparameters (s. details) that should be optimized to obtain good (and comparable !) results. As tuning scheme, we propose three fold Cross-Validation on each learningset (for fixed selected variables). Note that learningsets usually do not contain the complete dataset, so tuning involves a second level of splitting the dataset. Increasing the number of folds leads to larger datasets (and possibly to higher accuracy), but also to higher computing times. For S4 method information, s. link{tune-methods}

Usage

tune(X, y, f, learningsets, genesel, genesellist = list(), nbgene, classifier, fold = 3, strat = FALSE, grids = list(), trace = TRUE, ...)

Arguments

Gene expression data. Can be one of the following:

A matrix. Rows correspond to observations, columns to variables.
A data.frame, when f is not missing (s. below).
An object of class ExpressionSet.

Class labels. Can be one of the following:

A numeric vector.
A factor.
A character if X is an ExpressionSet that specifies the phenotype variable.
missing, if X is a data.frame and a proper formula f is provided.

A two-sided formula, if X is a data.frame. The left part correspond to class labels, the right to variables.

learningsets

An object of class learningsets. May be missing, then the complete datasets is used as learning set.

genesel

Optional (but usually recommended) object of class genesel containing variable importance information for the argument learningsets

genesellist

In the case that the argument genesel is missing, this is an argument list passed to GeneSelection. If both genesel and genesellist are missing, no variable selection is performed.

nbgene

Number of best genes to be kept for classification, based on either genesel or the call to GeneSelection using genesellist. In the case that both are missing, this argument is not necessary. note:

If the gene selection method has been one of "lasso", "elasticnet", "boosting", nbgene will be reset to min(s, nbgene) where s is the number of nonzero coefficients.
if the gene selection scheme has been "one-vs-all", "pairwise" for the multiclass case, there exist several rankings. The top nbgene will be kept of each of them, so the number of effective used genes will sometimes be much larger.

classifier

Name of function ending with CMA indicating the classifier to be used.

fold

The number of cross-validation folds used within each learningset. Default is 3. Increasing fold will lead to higher computing times.

strat

Should stratified cross-validation according to the class proportions in the complete dataset be used ? Default is FALSE.

grids

A named list. The names correspond to the arguments to be tuned, e.g. k (the number of nearest neighbours) for knnCMA, or cost for svmCMA. Each element is a numeric vector defining the grid of candidate values. Of course, several hyperparameters can be tuned simultaneously (though requiring much time). By default, grids is an empty list. In that case, a pre-defined list will be used, s. details.

trace

Should progress be traced ? Default is TRUE.

...

Further arguments to be passed to classifier, of course not one of the arguments to be tuned (!).

Value

tuningresult

Details

The following default settings are used, if the arguments grids is an empty list:

gbmCMA: n.trees = c(50, 100, 200, 500, 1000)
compBoostCMA: mstop = c(50, 100, 200, 500, 1000)
LassoCMA: norm.fraction = seq(from=0.1, to=0.9, length=9)
ElasticNetCMA: norm.fraction = seq(from=0.1, to=0.9, length=5), alpha = 2^{-(5:1)}
plrCMA: lambda = 2^{-4:4}
pls_ldaCMA: comp = 1:10
pls_lrCMA: comp = 1:10
pls_rfCMA: comp = 1:10
rfCMA: mtry = ceiling(c(0.1, 0.25, 0.5, 1, 2)*sqrt(ncol(X))), nodesize = c(1,2,3)
knnCMA: k=1:10
pknnCMA: k = 1:10
scdaCMA: delta = c(0.1, 0.25, 0.5, 1, 2, 5)
pnnCMA: sigma = c(2^{-2:2})
nnetCMA: size = 1:5, decay = c(0, 2^{-(4:1)})
svmCMA, kernel = "linear": cost = c(0.1, 1, 5, 10, 50, 100, 500)
svmCMA, kernel = "radial": cost = c(0.1, 1, 5, 10, 50, 100, 500), gamma = 2^{-2:2}
svmCMA, kernel = "polynomial": cost = c(0.1, 1, 5, 10, 50, 100, 500), degree = 2:4

References

Slawski, M. Daumer, M. Boulesteix, A.-L. (2008) CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics 9: 439

Examples

Run this code

## Not run: 
# ### simple example for a one-dimensional grid, using compBoostCMA.
# ### dataset
# data(golub)
# golubY <- golub[,1]
# golubX <- as.matrix(golub[,-1])
# ### learningsets
# set.seed(111)
# lset <- GenerateLearningsets(y=golubY, method = "CV", fold=5, strat =TRUE)
# ### tuning after gene selection with the t.test
# tuneres <- tune(X = golubX, y = golubY, learningsets = lset,
#               genesellist = list(method = "t.test"),
#               classifier=compBoostCMA, nbgene = 100,
#               grids = list(mstop = c(50, 100, 250, 500, 1000)))
# ### inspect results
# show(tuneres)
# best(tuneres)
# plot(tuneres, iter = 3)
# ## End(Not run)

Run the code above in your browser using DataLab