tdmReadAndSplit: Read and split the task data.

Description

Read the task data using tdmReadDataset and split them into a test part and a training/validation-part and return a TDMdata object.

Usage

tdmReadAndSplit(opts, tdm, nExp = 0, dset = NULL)

Arguments

opts

a list from which we need here the elements

READ.INI: [T] =T: do read and split, =F: return NULL
READ.*: other settings for tdmReadDataset
filename: needed for tdmReadDataset
filetest: needed for tdmReadDataset
TST.testFrac: [0.1] set this fraction of the daa aside for testing
TST.COL: string with name for the partitioning column, if tdm$umode is not "SP_T". (If tdm$umode=="SP_T", then TST.COL="tdmSplit" is used.)

tdm

a list from which we need here the elements

mainFile: if not NULL, set working dir to dir(mainFile) before executing tdmReadDataset
umode: [ "RSUB" | "CV" | "TST" | "SP_T" ], how to divide in training/validation data for tuning and test data for the unbiased runs
SPLIT.SEED: if NULL, set random number generator (RNG) to tdmRandomSeed when constructing. dataObj. If not NULL, set RNG to SPLIT.SEED + nExp --> deterministic test set split
stratified: [NULL] string specifying the column with the response variable for classification. If not NULL, do the split by stratified sampling (at least one record of each class level found in dset[,tdm$stratified] shall appear in the train-vali-set). Recommended for classification

nExp

[0] experiment counter, used to select a reproducible different seed, if tdm$SPLIT.SEED!=NULL

dset

[NULL] if non-NULL, reading of dset is skipped and the given data frame dset is used.

Value

dataObj, either NULL (if opts$READ.INI==FALSE) or an object of class TDMdata containing

dset

a data frame with the complete data set

TST.COL

string, the name of the column in dset which has a 1 for records belonging to the test set and a 0 for train/vali records. If tdm$umode=="SP_T", then TST.COL="tdmSplit", else TST.COL=opts$TST.COL.

filename

opts$filename, from where the data were read

Use the accessor functions dsetTrnVa.TDMdata and dsetTest.TDMdata to extract the train/vali and the test data, resp., from dataObj. Known caller: tdmBigLoop

Details

If dset is NULL, the files specified in opts are read into dset, see tdmReadDataset for details. Then, depending on the value of tdm$umode

"SP_T": split the data randomly into training and test data with test set fraction according to opts$TST.testFrac. Make use of tdm$SPLIT.SEED and tdm$stratified, if given. Set TST.COL to "tdmSplit".
"RSUB", "CV": use all data for training/validation. That is, the training-validation split is done later in tdmClassifyLoop or tdmRegressLoop.
"TST": split the data into training and test data according to column. opts$TST.COL (usually "TST.COL"), which carries a 1 for each test record and a 0 else. If opts$filetest is specified, then all records from this file will carry a 1 in opts$TST.COL. All records from opts$filename carry a 0.

Description

Usage

Arguments

Value

Details

See Also