Learn R Programming

TDMR (version 2.2)

tdmModCreateCVindex: Create and return a training-validation-set index vector.

Description

Depending on the value of member TST.kind in list opts, the returned index cvi is

  1. TST.kind="cv": a random cross validation index P([111...222...333...]) - or -

  2. TST.kind="rand": a random index with P([00...11...-1-1...]) for training (0), validation (1) and disregard (-1) cases - or -

  3. TST.kind="col": the column dset[,opts$TST.COL] contains the training (0), validation (1) and disregard (-1) set division (and all records with a value <0 in column TST.COL are disregarded).

Here P(.) denotes random permutation of the sequence. The disregard set is optional, i.e. cvi may contain only 0 and 1, if desired. Special case TST.kind="cv" and TST.NFOLD=1: make *every* record a training record, i.e. index [000...]. In case TST.kind="rand" and stratified=TRUE a stratified sample is drawn, where the strata in the training case reflect the rel. frequency of each level of the **1st** response variable and are ensured to be at least of size 1. In summary, TST.kind="cv" means cross validation (TST.NFOLD models are built with TST.NFOLD different train-validation data sets), while TST.kind="rand" or "col" means one model build with a random ("rand") or user-defined ("col") training-validation split.

Usage

tdmModCreateCVindex(dset, response.variables, opts, stratified = FALSE)

Arguments

dset

the data frame for which cvi is needed

response.variables

issue a warning if length(response.variables)>1. Use the first response variable for determining strata size.

opts

a list from which we need here the following entries

  • TST.kind: ["cv"|"rand"|"col"]

  • TST.NFOLD: number of CV folds (only relevant in case TST.kind=="cv")

  • TST.COL: column of dset containing the (0/1/<0) index (only relevant in case TST.kind=="col") or NULL if no such column exists

  • TST.valiFrac: fraction of records to set aside for validation (only relevant in case TST.kind=="rand")

  • TST.trnFrac: [1-opts$TST.valiFrac] fraction of records to use for training (only relevant in case TST.kind=="rand")

stratified

[F] do stratified sampling for TST.kind="rand" with at least one training record for each response variable level (classification)

Value

cvi training-validation-set (0/>0) index vector (all records with cvi<0, e.g. from column TST.COL, are disregarded)