Learn R Programming

rModeling (version 0.0.3)

crossValidation: Conduct cross-validation

Description

Conduct a cross-validation for a given classification/regression model and output the prediction results collected over the cross-validation loop. The cross-validation can be done in two ways: normal k-fold cross-validaiton (batch=NULL), or batch-wise cross-validation (batch!=NULL). The latter is particularly useful in the presence of significant intra-group heterogeneity.

Usage

crossValidation(data, label, batch = NULL, 
                method = lda, pred = predict, classify = TRUE, 
                folds = NULL, nBatch = 0, nFold = 10, 
                verbose = TRUE, seed = NULL, ...)

Arguments

data

a data matrix, with samples saved in rows and features in columns.

label

a vector of response variables (i.e., group/concentration info), must be the same length as the number of samples.

batch

a vector of sample identifications (e.g., batch/patient ID), must be the same length as the number of samples. Ideally, this should be the identification of the samples at the highest hierarchy (e.g., the patient ID rather than the spectral ID). If missing, a normal k-fold cross validaiton will be performed (i.e., the data is split randomly into k folds). Ignored if folds is given.

method

the name of the function to be performed on training data (can be any model-based procedures, like classification/regression or even pre-processings). A user-defined function is possible, see fnPcaLda as an example.

pred

the name of the function to be performed on testing data (eg. new substances) based on the model built by method. A user-defined function is possible, see predPcaLda as an example.

classify

a boolean value, classify=TRUE means a classification task, otherwise a regression task. It is used in the function predSummary.

folds

a list of indices specifying the sample index to be used in each fold, can be the output of function dataSplit. If missing, a data split will be done first before performing cross-validaiton

nBatch

an integer, the number of data folds in case of batch-wise cross-validaiton (if nBatch=0, each batch will be used as one fold). Ignored if folds is given or if batch is missing.

nFold

an integer, the value of k in case of normal k-fold cross-validaiton. Ignored if folds or batch is given.

verbose

a boolean value, if or not to print out the logging info

seed

an integer, if given, will be used as the random seed to split the data in case of k-fold cross-validation. Ignored if batch or folds is given.

parameters to be passed to the method

Value

A list with elements

Fold

a list, each giving the sample indices of a fold

True

a vector of characters, the groundtruth response variables, collected for each fold when it is used as testing data

Pred

a vector of characters, the results from prediction, collected for each fold when it is used as testing data

Summ

a list, the output of function predSummary. A confusion matrix (if classify=TRUE) from confusionMatrix or RMSE (if classify=FALSE) calculated from each fold being predicted.

Details

The cross-validaiton will be conducted based on the data partitions folds, each fold is predicted once using the model built on the rest folds. If folds is missing, a data split will be done first (see more in dataSplit).

The procedures to be performed within the cross-validation is given in the function method, for example, fnPcaLda. A user-defined function is possible, as long as the it follows the same structure as fnPcaLda. A two-layer cross-validation (see reference) can be done by using a tuning function as method, such as tunePcaLda (see examples). In this case, the parameters of a classifier are optimized using the training data within tunePcaLda and the optimal model is tested on the testing data. The parameters of pre-processing can be optimized in a similar way by involving the pre-processing steps into the function method.

NOTE: It is recommended to specify the seed for a normal k-fold cross-validation in order to get the same results from repeated runnings.

References

S. Guo, T. Bocklitz, et al., Common mistakes in cross-validating classification models. Analytical methods 2017, 9 (30): 4410-4417.

See Also

dataSplit

Examples

Run this code
# NOT RUN {
  data(DATA)
  ### perform batch-wise cross-validation using the function fnPcaLda
  RES3 <- crossValidation(data=DATA$spec
                          ,label=DATA$labels
                          ,batch=DATA$batch
                          ,method=fnPcaLda
                          ,pred=predPcaLda
                          ,folds=NULL 
                          ,nBatch=0
                          ,nFold=3
                          ,verbose=TRUE     
                          ,seed=NULL
                          
                          ### parameters to be passed to fnPcaLda
                          ,center=TRUE
                          ,scale=FALSE
   )


   ### perform a two-layer cross-validation using the function tunePcaLda,
   ### where the number of principal components used for LDA is optimized 
   ### (i.e., internal cross-validaiton).
   RES4 <- crossValidation(data=DATA$spec
                          ,label=DATA$labels				    
                          ,batch=DATA$batch	
                          ,method=tunePcaLda 
                          ,pred=predPcaLda     
                          ,folds=NULL      
                          ,nBatch=0			    
                          ,nFold=3					
                          ,verbose=TRUE     
                          ,seed=NULL
                          
                          ### parameters to be passed to tunePcaLda
                          ,nPC=2:4
                          ,cv=c('CV', 'BV')[2]
                          ,nPart=0
                          ,optMerit=c('Accuracy', 'Sensitivity')[2]
                          ,center=TRUE
                          ,scale=FALSE
  )
# }

Run the code above in your browser using DataLab