kbsvm,BioVector-method: KeBABS Training Methods

Description

Train an SVM-model with a sequence kernel on biological sequences

Usage

## S3 method for class 'BioVector':
kbsvm(x, y, kernel = NULL, pkg = "auto",
  svm = "C-svc", explicit = "auto", explicitType = "auto",
  featureType = "linear", featureWeights = "auto",
  weightLimit = .Machine$double.eps, classWeights = numeric(0), cross = 0,
  noCross = 1, groupBy = NULL, nestedCross = 0, noNestedCross = 1,
  perfParameters = character(0), perfObjective = "ACC", probModel = FALSE,
  sel = integer(0), features = NULL, showProgress = FALSE,
  showCVTimes = FALSE, runtimeWarning = TRUE,
  verbose = getOption("verbose"), ...)
## S3 method for class 'XStringSet':
kbsvm(x, y, kernel = NULL, pkg = "auto",
  svm = "C-svc", explicit = "auto", explicitType = "auto",
  featureType = "linear", featureWeights = "auto",
  weightLimit = .Machine$double.eps, classWeights = numeric(0), cross = 0,
  noCross = 1, groupBy = NULL, nestedCross = 0, noNestedCross = 1,
  perfParameters = character(0), perfObjective = "ACC", probModel = FALSE,
  sel = integer(0), features = NULL, showProgress = FALSE,
  showCVTimes = FALSE, runtimeWarning = TRUE,
  verbose = getOption("verbose"), ...)
## S3 method for class 'ExplicitRepresentation':
kbsvm(x, y, kernel = NULL, pkg = "auto",
  svm = "C-svc", explicit = "auto", explicitType = "auto",
  featureType = "linear", featureWeights = "auto",
  weightLimit = .Machine$double.eps, classWeights = numeric(0), cross = 0,
  noCross = 1, groupBy = NULL, nestedCross = 0, noNestedCross = 1,
  perfParameters = character(0), perfObjective = "ACC", probModel = FALSE,
  sel = integer(0), showProgress = FALSE, showCVTimes = FALSE,
  runtimeWarning = TRUE, verbose = getOption("verbose"), ...)
## S3 method for class 'KernelMatrix':
kbsvm(x, y, kernel = NULL, pkg = "auto",
  svm = "C-svc", explicit = "no", explicitType = "auto",
  featureType = "linear", featureWeights = "no",
  classWeights = numeric(0), cross = 0, noCross = 1, groupBy = NULL,
  nestedCross = 0, noNestedCross = 1, perfParameters = character(0),
  perfObjective = "ACC", probModel = FALSE, sel = integer(0),
  showProgress = FALSE, showCVTimes = FALSE, runtimeWarning = TRUE,
  verbose = getOption("verbose"), ...)

Arguments

multiple biological sequences in the form of a DNAStringSet, RNAStringSet, AAStringSet (or as BioVector). Also a precomputed kernel matrix (see getKernelMatrix or a precomputed explicit representation (see getExRep can be used instead. If they were precomputed with a sequence kernel this kernel should be specified in the parameter kernel in this case.

response vector which contains one value for each sample in 'x'. For classification tasks this can be either a character vector, a factor or a numeric vector, for regression tasks it must be a numeric vector. For numeric labels in binary classification the positive class must have the larger value, for factor or character based labels the positive label must be at the first position when sorting the labels in descendent order according to the C locale. If the parameter sel is used to perform training with a sample subset the response vector must have the same length as 'sel'.

kernel

a sequence kernel object or a string kernel from package kernlab. In case of grid search or model selection a list of sequence kernel objects can be passed to training.

pkg

name of package which contains the SVM implementation to be used for training, e.g. kernlab, e1071 or LiblineaR. For gridSearch or model selection multiple packages can be passed as character vector. (see also parameter svm below). Default="auto"

svm

name of the SVM used for the classification or regression task, e.g. "C-svc". For gridSearch or model selection multiple SVMs can be passed as character vector. For each entry in this character vector a corresponding entry in the character vector for parameter pkg is required, if multiple SVMs are used in one cross validation or model selection run.

explicit

this parameter controls whether training should be performed with the kernel matrix (see getKernelMatrix) or explicit representation (see getExRep). When the parameter is set to "no" the kernel matrix is used, for "yes" the model is trained from the explicit representation. When set to "auto" KeBABS automatically selects a variant based on runtime heuristics. For training via kernel matrix the dense LIBSVM implementation included in package kebabs is the preferred processing variant. Default="auto"

explicitType

this parameter is only relevant when parameter 'explicit' is different from "no". The values "sparse" and "dense" indicate whether a sparse or dense explicit representation should be used. When the parameter is set to "auto" KeBABS selects a variant. Default="auto"

featureType

when the parameter is set to "linear" single features areused in the analysis (with a linear kernel matrix or a linear kernel applied to the linear explicit representation). When set to "quadratic" the analysis is based on feature pairs. For an SVM from LiblineaR (which does not support kernels) KeBABS generates a quadratic explicit representation. For the other SVMs a polynomial kernel of degree 2 is used for learning via explicit representation. In the case of learning via kernel matrix a quadratic kernel matrix (quadratic here in the sense of linear kernel matrix with each element taken to power 2) is generated. Default="linear"

featureWeights

with the values "no" and "yes" the user can control whether feature weights are calulated as part of the training. When the parameter is set to "auto" KeBABS selects a variant (see below). Default="auto"

weightLimit

the feature weight limit is a single numeric value and allows pruning of feature weights. All feature weights with an absolute value below this limit are set to 0 and are not considered in the model and for further predictions. This parameter is only relevant when featureWeights are calculated in KeBABS during training. Default=.Machine$double.eps

classWeights

a numeric named vector of weights for the different classes, used for asymmetric class sizes. Each element of the vector must have one of the class names but not all class names must be present. Default=1

cross

an integer value K > 0 indicates that k-fold cross validation should be performed. A value -1 is used for Leave-One-Out (LOO) cross validation. (see above) Default=0

noCross

an integer value larger than 0 is used to specify the number of repetitions for cross validation. This parameter is only relevant if 'cross' is different from 0. Default=1

groupBy

allows a grouping of samples during cross validation. The parameter is only relevant when 'cross' is larger than 1. It is an integer vector or factor with the same length as the number of samples used for training and specifies for each sample to which group it belongs. Samples from the same group are never spread over more than one fold. (see crossValidation). Grouped cross validation can also be used in grid search for each grid point. Default=NULL

nestedCross

in integer value K > 0 indicates that a model selection with nested cross validation should be performed with a k-fold outer cross validation. The inner cross validation is defined with the 'cross' parameter (see below), Default=0

noNestedCross

an integer value larger than 0 is used to specify the number of repetitions for the nested cross validation. This parameter is only relevant if 'nestedCross' is larger than 0. Default=1

perfParameters

a character vector with one or several values from the set "ACC" , "BACC", "MCC", "AUC" and "ALL". "ACC" stands for accuracy, "BACC" for balanced accuracy, "MCC" for Matthews Correlation Coefficient, "AUC" for area under the ROC curve and "ALL" for all four. This parameter defines which performance parameters are collected in cross validation, grid search and model selection for display purpose. The value "AUC" is currently not supported for multiclass classification. Default=NULL

perfObjective

a singe character string from the set "ACC", "BACC" and "MCC" (see previous parameter). The parameter is only relevant in grid search and model selection and defines which performance measure is used to determine the best performing parameter set. Default="ACC"

probModel

when setting this boolean parameter to TRUE a probability model is determined as part of the training (see below). Default=FALSE

sel

subset of indices into x. When this parameter is present the training is performed for the specified subset of samples only. Default=integer(0)

features

feature subset of the specified kernel in the form of a character vector. When a feature subset is passed to the function all other features in the feature space are not considered for training (see below). A feature subset can only be used when a single kernel object is specified in the 'kernel' parameter. Default=NULL

showProgress

when setting this boolean parameter to TRUE the progress of a cross validation is displayed. The parameter is only relevant for cross validation. Default=FALSE

showCVTimes

when setting this boolean parameter to TRUE the runtimes of the cross validation runs are shown after the cross validation is finished. The parameter is only relevant for cross validation. Default=FALSE

runtimeWarning

when setting this boolean parameter to FALSE a warning for long runtimes will not be shown in case of large feature space dimension or large number of samples. Default=TRUE

verbose

boolean value that indicates whether KeBABS should print additional messages showing the internal processing logic in a verbose manner. The default value depends on the R session verbosity option. Default=getOption("verbose")

...

additional parameters which are passed to SVM training transparently.

Value

kbsvm: upon successful completion, the function returns a model of class KBModel. Results for cross validation can be retrieved from this model with the accessor cvResult, results for grid search or model selection with modelSelResult. In case of model selection the results of the outer cross validation loop can be retrieved with with the accessor cvResult.

Details

Overview The kernel-related functionality provided in this package is specifically centered around biological sequences, i.e. DNA-, RNA- or AA-sequences (see also DNAStringSet, RNAStringSet and AAStringSet) and Support Vector Machine (SVM) based methods. Apart from the implementation of the most relevant kernels for sequence analysis (see spectrumKernel, mismatchKernel, gappyPairKernel and motifKernel) KeBABS also provides a framework which allows easy interworking with existing SVM implementations in other R packages. In the current implementation the SVMs provided in the packages kernlab, e1071 and LiblineaR are in focus. Starting with version 1.2.0 KeBABS also contains the dense implementation of LIBSVM which is functionally equivalent to the sparse implementation of LIBSVM in package e1071 but additionally supports dense kernel matrices as preferred implementation for learning via kernel matrices. This framework can be considered like a "meta-SVM", which provides a simple and unified user interface to these SVMs for classification (binary and multiclass) and regression tasks. The user calls the "meta-SVM" in a classical SVM-like manner by passing sequence data, a sequence kernel with kernel parameters and the SVM which should be used for the learning task togehter with SVM parameters. KeBABS internally generates the relevant representations (see getKernelMatrix or getExRep) from the sequence data using the specified kernel, adapts parameters and formats to the selected SVM and internally calls the actual SVM implementation in the requested package. KeBABS unifies the result returned from the invoked SVM and returns a unified data structure, the KeBABS model, which also contains the SVM-specific model (see svmModel. The KeBABS model is used in prediction (see predict) to predict the response for new sequence data. On user request the feature weights are computed and stored in the Kebabs model during training (see below). The feature weights are used for the generation of prediction profiles (see getPredictionProfile) which show the importance of sequence positions for a specfic learning task. Training of biological sequences with a sequence kernel Training is performed via the method kbsvm for classification and regression tasks. The user passes sequence data, the response vector, a sequence kernel object and the requested SVM along with SVM parameters to kbsvm and receives the training results in the form of a KeBABS model object of class KBModel. The accessor svmModel allows to retrieve the SVM specific model from the KeBABS model object. However, for regular operation a detailed look into the SVM specific model is usually not necessary.

The standard data format for sequences in KeBABS are the XStringSet-derived classes DNAStringSet, RNAStringSet and AAStringSet. (When repeat regions are coded as lowercase characters and should be excluded from the analysis the sequence data can be passed as BioVector which also supports lowercase characters instead of XStringSet format. Please note that the classes derived from XStringSet are much more powerful than the BioVector derived classes and should be used in all cases where lowercase characters are not needed).

Instead of sequences also a precomputed explicit representation or a precomputed kernel matrix can be used for training. Examples for training with kernel matrix and explicit representation can be found on the help page for the prediction method predict.

Apart from SVM training kbsvm can be also used for cross validation (see crossValidation and parameters cross and noCross), grid search for SVM- and kernel-parameter values (see gridSearch) and model selection (see modelSelection and parameters nestedCross and noNestedCross). Package and SVM selection The user specifies the SVM implementation to be used for a learning task by selecting the package with the pkg parameter and the SVM method in the package with the SVM parameter. Currently the packages code{kernlab}, e1071 and LiblineaR are supported. The names for SVM methods vary from package to package and KeBABS provide following unified names which can be selected across packages. The following table shows the available SVM methods: ll{ SVM name description ----------------------- ----------------------------------------- --------- C-svc: C classification (with L2 regularization and L1 loss) l2rl2l-svc: classif. with L2 regularization and L2 loss (dual) l2rl2lp-svc: classif. with L2 regularization and L2 loss (primal) l1rl2l-svc: classification with L1 regularization and L2 loss nu-svc: nu classification C-bsvc: bound-constraint SVM classification mc-natC: Crammer, Singer native multiclass mc-natW: Weston, Watkins native multiclass one-svc: one class classification eps-svr: epsilon regression nu-svr: nu regression eps-bsvr: bound-constraint svm regression }

Pairwise multiclass can be selected for C-svc and nu-svc if the label vector contains more than two classes. For LiblineaR the multiclass implementation is always based on "one against the rest" for all SVMs except for mc-natC which implements native multiclass according to Crammer and Singer. The following table shows which SVM method is available in which package: lccc{ SVM name kernlab e1071 LiblineaR -------------------- -------------- -------------- ------ -------- C-svc: x x x l2rl2l-svc: - - x l2rl2lp-svc: - - x l1rl2l-svc: - - x nu-svc: x x - C-bsvc: x - - mc-natC: x - x mc-natW: x - - one-svc: x x - eps-svr: x x - nu-svr: x x - eps-bsvr: x - - }

SVM parameters To avoid unnecessary changes of parameters names when switching between SVM implementation in different packages unified names for identical parameters are available. They are translated by KeBABS to the SVM specific name. The obvious example is the cost parameter for the C-svm. It is named C in kernlab and cost in e1071 and LiblineaR. The unified name in KeBABS is cost. If the parameter is passed to kbsvm in a package specific version it is translated back to the KeBABS name internally. This applies to following parameters - here shown with their unified names: ll{ parameter name description ----------------------- ----------------------------------------- ----------- cost: cost parameter of C-SVM nu: nu parameter of nu-SVM eps: epsilon parameter of eps-SVR and nu-SVR classWeights: class weights for asymmetrical class size tolerance: tolerance as termination crit. for optimization cross: number of folds in k-fold cross validation }

Hint: If a tolerance value is specified in kbsvm the same value should be used throughout the complete analysis to make results comparable. The following table shows the relevance of the SVM parameters cost, nu and eps for the different SVMs: lccc{ SVM name cost nu eps -------------------- -------------- -------------- ----- --------- C-svc: x - - l1rl2l-svc: x - - l1rl2lp-svc: x - - l1rl2l-svc: x - - nu-svc: - x - C-bsvc: x - - mc-natC: x - - mc-natW: x - - one-svc: x - - eps-svr: - - x nu-svr: - x - eps-bsvr: - - x }

Hint: Please be aware that identical parameter names between different SVMs do not necessarily mean, that their values are also identical between packages but they depend on the actual SVM formulation which could be different. For example the cost parameter is identical between C-SVMs in packages kernlab, e1071 and LiblineaR but is for example different from the cost parameter in l2rl2l-svc in LiblineaR because the C-SVM uses a linear loss but the l2rl2l-svc uses a quadratic loss. Feature weights On user request (see parameter featureWeights) feature weights are computed amd stored in the model (for a detailed description see getFeatureWeights). Pruning of feature weights can be achieved with the parameter weightLimit which defines the cutoff for small feature weights not stored in the model.

Hint: For training with a precomputed kernel matrix feature weights are not available. For multiclass prediction is currently not performed via feature weights but native in the SVM. Cross validation, grid search and model selection Cross validation can be controlled with the parameters cross and noCross. For details on cross validation see crossValidation. Grid search can be performed by passing multiple SVM parameter values as vector instead of a single value to kbsvm. Also multiple sequence kernel objects and multiple SVMs can be used for grid search. For details see gridSearch. For model selection nested cross validation is used with the parameters nestedCross and noNestedCross for the outer and cross and noCross for the inner cross validation. For details see modelSelection. Training with feature subset After performing feature selection repeating the learning task with a feature subset can easily be achieved by specifying a feature subset with the parameter features as character vector. The feature subset must be a subset from the feature space of the sequence kernel passed in the parameter kernel. Grid search and model selection with a feature subset can only be used for a single sequence kernel object in the parameter kernel. Hint: For normalized kernels all features of the feature space are used for normalization not just the feature subset. For a normalized motif kernel (see motifKernel) only the features listed in the motif list are part of the feature space. Therefore the motif kernel defined with the same feature subset leads to a different result in the normalized case. Probability model SVMs from the packages kernlab and e1071 support the generation of a probability model using Platt scaling (for details see kernlab, predict.ksvm, svm and predict.svm) allowing the computation of class probabilities during prediction. The parameter probabilityModel controls the generation of a probability model during training (see also parameter predictionType in predict).

References

http://www.bioinf.jku.at/software/kebabs J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics, 31(15):2574-2576, 2015. DOI: http://dx.doi.org/10.1093/bioinformatics/btv176{10.1093/bioinformatics/btv176}.

Examples

Run this code

## load transcription factor binding site data
data(TFBS)
enhancerFB
## we use 70 of the samples for training and the rest for test
train <- sample(1:length(enhancerFB), length(enhancerFB) * 0.7)
test <- c(1:length(enhancerFB))[-train]
## create the kernel object for dimers without normalization
specK2 <- spectrumKernel(k=2)
## show details of kernel object
specK2

## run training with kernel matrix on e1071 (via the
## dense LIBSVM implementation integrated in kebabs)
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="no")

## show KeBABS model
model
## show class of KeBABS model
class(model)
## show native SVM model contained in KeBABS model
svmModel(model)
## show class of native SVM model
class(svmModel(model))

## examples for package and SVM selection
## now run the same samples with the same kernel on e1071 via
## explicit representation
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="yes")

## show KeBABS model
model
## show native SVM model contained in KeBABS model
svmModel(model)
## show class of native SVM model
class(svmModel(model))

## run the same samples with the same kernel on e1071 with nu-SVM
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
               pkg="e1071", svm="nu-svc",nu=0.7, explicit="yes")

## show KeBABS model
model

## training with feature weights
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="yes",
               featureWeights="yes")

## show feature weights
dim(featureWeights(model))
featureWeights(model)[,1:5]

## training without feature weights
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="yes",
               featureWeights="no")

## show feature weights
featureWeights(model)

## pruning of feature weights
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="yes",
               featureWeights="yes", weightLimit=0.5)

dim(featureWeights(model))

## training with precomputed kernel matrix
## feature weights cannot be computed for precomputed kernel matrix
km <- getKernelMatrix(specK2, x=enhancerFB, selx=train)
model <- kbsvm(x=km, y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="no")

## training with precomputed explicit representation
exrep <- getExRep(enhancerFB, sel=train, kernel=specK2)
model <- kbsvm(x=exrep, y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="yes")

## computing of probability model via Platt scaling during training
## in prediction class membership probabilities can be computed
## from this probability model
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK2,
               pkg="e1071", svm="C-svc", C=10, explicit="yes",
               probModel=TRUE)

## show parameters of the fitted probability model which are the parameters
## probA and probB for the fitted sigmoid function in case of classification
## and the value sigma of the fitted Laplacian in case of a regression
probabilityModel(model)

## cross validation, grid search and model selection are also performed
## via the kbsvm method. Examples can be found on the respective help pages
## (see Details section)

Run the code above in your browser using DataLab