Read the task data using tdmReadDataset
and split them into a test part and
a training/validation-part and return a TDMdata
object.
tdmReadAndSplit(opts, tdm, nExp = 0, dset = NULL)
a list from which we need here the elements
READ.INI
: [T] =T: do read and split, =F: return NULL
READ.*
: other settings for tdmReadDataset
filename
: needed for tdmReadDataset
filetest
: needed for tdmReadDataset
TST.testFrac
: [0.1] set this fraction of the daa aside for testing
TST.COL
: string with name for the partitioning column, if tdm$umode is not "SP_T".
(If tdm$umode=="SP_T", then TST.COL="tdmSplit" is used.)
a list from which we need here the elements
mainFile
: if not NULL, set working dir to dir(mainFile)
before executing tdmReadDataset
umode
: [ "RSUB" | "CV" | "TST" | "SP_T" ], how to divide in training/validation data for tuning
and test data for the unbiased runs
SPLIT.SEED
: if NULL, set random number generator (RNG) to tdmRandomSeed
when constructing.
dataObj
. If not NULL, set RNG to SPLIT.SEED + nExp --> deterministic test set split
stratified
: [NULL] string specifying the column with the response variable for classification.
If not NULL, do the split by stratified sampling (at least one record of each class level
found in dset[,tdm$stratified]
shall appear in the train-vali-set). Recommended for classification
[0] experiment counter, used to select a reproducible different seed, if tdm$SPLIT.SEED!=NULL
[NULL] if non-NULL, reading of dset is skipped and the given data frame dset is used.
dataObj
, either NULL (if opts$READ.INI==FALSE
) or an object of class TDMdata
containing
a data frame with the complete data set
string, the name of the column in dset
which has a 1 for
records belonging to the test set and a 0 for train/vali records. If tdm$umode=="SP_T", then
TST.COL="tdmSplit", else TST.COL=opts$TST.COL.
opts$filename
, from where the data were read
If dset
is NULL, the files specified in opts
are read into dset, see
tdmReadDataset
for details. Then, depending on the value of tdm$umode
"SP_T"
: split the data randomly into training and test data with test
set fraction according to opts$TST.testFrac
. Make use of tdm$SPLIT.SEED
and tdm$stratified
, if given. Set TST.COL to "tdmSplit"
.
"RSUB", "CV"
: use all data for training/validation. That is, the
training-validation split is done later in tdmClassifyLoop
or
tdmRegressLoop
.
"TST"
: split the data into training and test data according to column.
opts$TST.COL
(usually "TST.COL"
), which carries a 1 for each test record and a 0 else.
If opts$filetest
is specified, then all records from this file will
carry a 1 in opts$TST.COL
. All records from opts$filename
carry a 0.
dsetTrnVa.TDMdata
, dsetTest.TDMdata
, tdmReadDataset
, tdmBigLoop