dataSplit: A procedure to split whole dataset into multiple folds.

Description

the whole dataset is split into multiple folds randomly (batch=NULL) or according to the batch information (batch is specified). The number of folds are defined by nFold in the former case. In the latter case, data belonging to each batch is used as one fold if nBatch=0, otherwise the dataset is split into nBatch folds according to the batch information (i.e., data from the same batch will be used exclusively in one fold).

Usage

dataSplit(ixData, batch = NULL, 
           nBatch = 0, nFold = 10, 
           verbose = TRUE, seed = NULL)

Arguments

ixData

a vector of integers, demonstrating the indices of spectra.

batch

a vector of sample identifications (e.g., batch/patient ID), must be the same length as ixData. Ideally, this should be the identification of the samples at the highest hierarchy (e.g., the patient ID rather than the spectral ID). If missing, the data is split randomly into nFold folds.

nBatch

an integer, the number of data folds in case of batch-wise cross-validaiton (if nBatch=0, each batch will be used as one fold). Ignored if batch is missing.

nFold

an integer, the number of data folds in case of normal k-fold cross-validaiton. Ignored if batch is given.

verbose

a boolean value, if or not to print out the logging info.

seed

an integer, if given, will be used as the random seed to split the data in case of k-fold cross-validation. Ignored if batch is given.

Value

a list, of which each element representing the indices of the sample belonging to one fold.

References

S. Guo, T. Bocklitz, et al., Common mistakes in cross-validating classification models. Analytical methods 2017, 9 (30): 4410-4417.