performCrossValidation,KernelMatrix-method: KeBABS Cross Validation

Description

Perform cross validation as k-fold cross validation, Leave-One-Out cross validation(LOOCV) or grouped cross validation (GCV).

Usage

## kbsvm(......, cross=0, noCross=1, .....)
## please use kbsvm for cross validation and do not call the
## performCrossValidation method directly
"performCrossValidation"(object, x, y, sel, model, cross, noCross, groupBy, perfParameters, verbose)

Arguments

object

a kernel matrix or an explicit representation

an optional set of sequences

a response vector

sel

sample subset for which cross validation should be performed

model

KeBABS model

cross

an integer value K > 0 indicates that k-fold cross validation should be performed. A value -1 is used for Leave-One-Out (LOO) cross validation. (see above) Default=0

noCross

an integer value larger than 0 is used to specify the number of repetitions for cross validation. This parameter is only relevant if 'cross' is different from 0. Default=1

groupBy

allows a grouping of samples during cross validation. The parameter is only relevant when 'cross' is larger than 1. It is an integer vector or factor with the same length as the number of samples used for training and specifies for each sample to which group it belongs. Samples from the same group are never spread over more than one fold. Grouped cross validation can also be used in grid search for each grid point. Default=NULL

perfParameters

a character vector with one or several values from the set "ACC" , "BACC", "MCC", "AUC" and "ALL". "ACC" stands for accuracy, "BACC" for balanced accuracy, "MCC" for Matthews Correlation Coefficient, "AUC" for area under the ROC curve and "ALL" for all four. This parameter defines which performance parameters are collected in cross validation for display purpose. The summary values are computed as mean of the fold values. AUC computation from pooled decision values requires a calibrated classifier output and is currently not supported. Default=NULL

verbose

boolean value that indicates whether KeBABS should print additional messages showing the internal processing logic in a verbose manner. The default value depends on the R session verbosity option. Default=getOption("verbose")

this parameter is not relevant for cross validation because the method performCrossValidation should not be called directly. Cross validation is performed with the method kbsvm and the parameters cross and numCross are described there

Value

cross validation stores the cross validation results in the KeBABS model object returned by . They can be retrieved with the accessor cvResult returned by kbsvm.

Details

Overview

Cross validation (CV) provides an estimate for the generalization performance of a model based on repeated training on different subsets of the data and evaluating the prediction performance on the remaining data not used for training. Dependent on the strategy of splitting the data different variants of cross validation exist. KeBABS implements k-fold cross validation, Leave-One-Out cross validation and Leave-Group-Out cross validation which is a specific variant of k-fold cross validation. Cross validation is invoked with kbsvm through setting the parameters cross and noCross. It can either be used for a given kernel and specific values of the SVM hyperparameters to compute the cross validation error of a single model or in conjuction with grid search (see gridSearch) and model selection (see modelSelection) to determine the performance of multiple models.

k-fold Cross Validation and Leave-One-Out Cross Validation(LOOCV)

For k-fold cross validation the data is split into k roughly equal sized subsets called folds. Samples are assigned to the folds randomly. In k successive training runs one of the folds is kept in round-robin manner for predicting the performance while using the other k-1 folds together as training data. Typical values for the number of folds k are 5 or 10 dependent on the number of samples used for CV. For LOOCV the fold size decreases to 1 and only a single sample is kept as hold out fold for performance prediction requiring the same number of training runs in one cross validation run as the number of sequences used for CV.

Grouped Cross Validation (GCV)

For grouped cross validation samples are assigned to groups by the user before running cross validation, e.g. via clustering the sequences. The predefined group assignment is passed to CV with the parameter groupBy in kbsvm. GCV is a special version of k-fold cross validation which respects group boundaries by avoiding to distribute samples of one group over multiple folds. In this way the group(s) in the test fold do not occur during training and learning is forced to concentrate on more complex features instead of the simple features splitting the groups. For GCV the parameter cross must be smaller than or equal to the number of groups.

Cross Validation Result

The cross validation error, which is the average of the predicition errors in all held out folds, is used as an estimate for the generalization error of the model assiciated with the cross validation run. For classification the fraction of incorrectly classified samples and for regression the mean squared error (MSE) is used as prediction error. Multiple cross validation runs can be performed through setting the parameter noCross. The cross validation result can be extracted from the model object returned by cross validation with the cvResult accessor. It contains the mean CV error over all runs, the CV errors of the single runs and the CV error for each fold. The CV result object can be plotted with the method plot showing the variation of the CV error for the different runs as barplot. With the parameter perfParameters in kbsvm the accuracy, the balanced accuracy and the Matthews correlation coefficient can be requested as additional performance parameters to be recorded in the CV result object which might be of interest especially for unbalanced datasets.

References

http://www.bioinf.jku.at/software/kebabs J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics (accepted). DOI: 10.1093/bioinformatics/btv176.

Examples

Run this code

## load transcription factor binding site data
data(TFBS)
enhancerFB
## select a few samples for training - here for demonstration purpose
## normally you would use 70 or 80% of the samples for training and
## the rest for test
## train <- sample(1:length(enhancerFB), length(enhancerFB) * 0.7)
## test <- c(1:length(enhancerFB))[-train]
train <- sample(1:length(enhancerFB), 50)
## create a kernel object for the gappy pair kernel with normalization
gappy <- gappyPairKernel(k=1, m=4)
## show details of kernel object
gappy

## run cross validation with the kernel on C-svc in LiblineaR for cost=10
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappy,
               pkg="LiblineaR", svm="C-svc", cost=10, cross=3)

## show cross validation result
cvResult(model)

## Not run: 
# ## perform tive cross validation runs
# model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappy,
#                pkg="LiblineaR", svm="C-svc", cost=10, cross=10, noCross=5)
# 
# ## show cross validation result
# cvResult(model)
# 
# ## plot cross validation result
# plot(cvResult(model))
# 
# 
# ## run Leave-One-Out cross validation
# model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappy,
#                pkg="LiblineaR", svm="C-svc", cost=10, cross=-1)
# 
# ## show cross validation result
# cvResult(model)
# 
# ## run gouped cross validation with full data
# ## on coiled coil dataset
# ##
# ## In this example the groups were determined through single linkage
# ## clustering of sequence similarities derived from ungapped heptad-specific
# ## pairwise alignment of the sequences. The variable {\tt ccgroup} contains
# ## the pre-calculated group assignments for the individual sequences.
# data(CCoil)
# ccseq
# head(yCC)
# head(ccgroups)
# gappyK1M6 <- gappyPairKernel(k=1, m=4)
# 
# ## run k-fold CV without groups
# model <- kbsvm(x=ccseq, y=as.numeric(yCC), kernel=gappyK1M6,
# pkg="LiblineaR", svm="C-svc", cost=10, cross=3, noCross=2,
# perfObjective="BACC",perfParameters=c("ACC", "BACC"))
# 
# ## show result without groups
# cvResult(model)
# 
# ## run grouped CV
# model <- kbsvm(x=ccseq, y=as.numeric(yCC), kernel=gappyK1M6,
# pkg="LiblineaR", svm="C-svc", cost=10, cross=3,
# noCross=2, groupBy=ccgroups, perfObjective="BACC",
# perfParameters=c("ACC", "BACC"))
# 
# ## show result with groups
# cvResult(model)
# 
# ## For grouped CV the samples in the held out fold are from a group which
# ## is not present in training on the other folds. The simimar CV error
# ## with and without groups shows that learning is not just assigning
# ## labels based on similarity within the groups but is focusing on features
# ## that are indicative for the class also in the CV without groups. For the
# ## GCV no information about group membership for the samples in the held
# ## out fold is present in the model. This example should show how GCV
# ## is performed. Because of package size limitations no specific dataset is
# ## available in this package where GCV is necessary.
# ## End(Not run)

Run the code above in your browser using DataLab