cross_validation: Cross validation method

Description

The ML-based classification model is trained and tested with N-fold cross validation method.

Usage

cross_validation(seed = 1, method = c("randomForest", "svm", "nnet" ),  featureMat, positives, negatives, cross = 5, cpus = 1, ...)

Arguments

seed

an integer number specifying a random seed for randomly partitioning dataset.

method

a character string specifying machine learning method. Possible values are "randomForest", "nnet" or "svm"

featureMat

a numeric feature matrix.

positives

a character vector reocrding positive samples

negatives

a character vector recording negative samples.

cross

number of fold for cross validation.

cpus

an integer number specifying the number of cpus to be used for parallel computing.

...

Further parameters used to cross validation. Same with the parameters used in the classifer function.

Value

positves.train: positive samples used to train prediction model.
negatives.train: negative samples used to train prediction model.
positives.test: positive samples used to test prediction model.
negatives.test: negative samples used to test prediction model.
ml: machine learning method.
classifier: prediction model constructed with the best parameters obtained from training dataset.
positives.train.score: scores of postive samples in training dataset predicted by classifier.
positives.train.score: scores of postive samples in training dataset predicted by classifier.
positives.test.score: scores of postive samples in testing dataset predicted by classifier.
negatives.test.score: scores of negative samples in testing dataset predicted by classifier.
train.AUC: AUC value of the ML-based classifer on training dataset.
test.AUC: AUC value of the ML-based classifer on testing dataset.

Details

In machine learning, the cross validation method has been widely used to evaluate the performance of ML-based classification models (classifiers).

For N-fold cross validation, positive and negative samples are randomly partitioned into N groups with approximately equal amount of samples, and each group is successively used for testing the performance of the ML-based classifier trained with the other N-1 groups of positive and negative samples.

For each round of cross validation, the prediction accuracy of the ML-based classifier was assessed using the receiver operating characteristic (ROC) curve analysis.The ROC curve is a two-dimensional plot of the false positive rate (FPR, x-axis) against the true positive rate (TPR, y-axis) at all possible thresholds. The value of area under the ROC curve (AUC) was used to quantitatively score the prediction accuracy of the ML-based classifer. The AUC value is ranged from 0 to 1.0, with higher AUC value indicates a better prediction accuracy of the ML-based classifer.

After N groups have been successively used as the testing set, the N sets of (FPR, TPR) pairs were imported into R package ROCR to visualize the ROC curves. The mean value of N AUCs was then computed as the overall performance of the ML-based classification model.

Examples

Run this code


## Not run: 
# 
#    ##generate expression feature matrix
#    sampleVec1 <- c(1, 2, 3, 4, 5, 6)
#    sampleVec2 <- c(1, 2, 3, 4, 5, 6)
#    featureMat <- expFeatureMatrix( expMat1 = ControlExpMat, sampleVec1 = sampleVec1, 
#                                    expMat2 = SaltExpMat, sampleVec2 = sampleVec2, 
#                                    logTransformed = TRUE, base = 2,
#                                features = c("zscore", "foldchange", "cv", "expression"))
# 
#    ##positive samples
#    positiveSamples <- as.character(sampleData$KnownSaltGenes)
#    ##unlabeled samples
#    unlabelSamples <- setdiff( rownames(featureMat), positiveSamples )
#    idx <- sample(length(unlabelSamples))
#    ##randomly selecting a set of unlabeled samples as negative samples
#    negativeSamples <- unlabelSamples[idx[1:length(positiveSamples)]]
# 
#    ##five-fold cross validation
#    seed <- randomSeed() #generate a random seed
#    cvRes <- cross_validation(seed = seed, method = "randomForest", 
#                              featureMat = featureMat, 
#                              positives = positiveSamples, 
#                              negatives = negativeSamples, 
#                              cross = 5, cpus = 1,
#                              ntree = 100 ) ##parameters for random forest algorithm
# 
#    ##get AUC values for five rounds of cross validation
#    aucVec <- rep(0, 5) 
#    for( i in 1:5 ) 
#      aucVec[i] = cvRes[[i]]$test.AUC
#   
#    
#    ##average AUC values as the final performance of the ML-based classifier
#    mean(aucVec)
# 
#  
# ## End(Not run)

Run the code above in your browser using DataLab