opts
.Set up and return a list opts
with default settings. The list opts
contains all DM-related settings which are needed by main_<TASK>.
For better readability, most elements of opts
are arranged in groups:
dir.* |
path-related settings |
READ.* |
data-reading-related settings |
TST.* |
resampling-related settings (training, validation and test set, CV) |
PRE.* |
preprocessing parameters |
SRF.* |
several parameters for tdmModSortedRFimport |
MOD.* |
general settings for models and model building |
RF.* |
several parameters for model RF (Random Forest) |
SVM.* |
several parameters for model SVM (Support Vector Machines) |
ADA.* |
several parameters for model ADA (AdaBoost) |
CLS.* |
classification-related settings |
GD.* |
settings for the graphic devices |
tdmOptsDefaultsSet(opts = NULL, path = ".")
(optional) the options already set
["."] where to find everything for the DM task.
a list opts
, with defaults set for all options relevant for a DM task,
containing the following elements
["."] where to find everything for the DM task
[data] where to find .txt/.csv files
[data] where to find other data files, including .Rdata
[Output] where to put output files
["default.txt"] the task data
[NULL] the test data, only relevant for READ.TstFn!=NULL
["Default Data"] title for plots
[T] =T: read data from .csv and save as .Rdata, =F: read from .Rdata
[-1] read this amount of rows or -1 for 'read all rows'
function to be passed into tdmReadDataset
. Signature: function(opts)
returning a data frame. It reads the train-validation data.
[NULL] function to be passed into tdmReadDataset
. Signature: function(opts)
returning a data frame. It reads a separate test data file. If NULL, this reading step is skipped.
[TRUE] read the task data initially, i.e. prior to tuning, using tdmReadDataset
.
If =FALSE, the data are read anew in each pass through main_TASK, i.e. in each tuning step (deprecated).
["rand"] one of the choices from {"cv","rand","col"}, see tdmModCreateCVindex
for details
["TST.COL"] name of column with train/test/disregard-flag
[3] number of CV-folds (only for TST.kind=="cv")
[0.1] set this fraction of the train-validation data aside for validation (only for TST.kind=="rand")
[0.1] set prior to tuning this fraction of data aside for testing (if tdm$umode=="SP_T" and opts$READ.INI==TRUE) or set this fraction of data aside for testing after tuning (if tdm$umode=="RSUB" or =="CV")
[NULL] train set fraction, if NULL then tdmModCreateCVindex
will set it to 1 - opts$TST.valiFrac.
[NULL] a seed for the random test set selection (tdmRandomSeed
) and random validation set selection.
(tdmClassifyLoop
). If NULL, use tdmRandomSeed
.
["none" (default)|"linear"] PCA preprocessing: [don't | do normal PCA (prcomp) ]
[T] =T: replace with the PCA columns the original numerical columns, =F: add the PCA columns
[0] if >0: add monomials of degree 2 from the first PRE.PCA.npc columns (PCs) (only active, if opts$PRE.PCA!="none")
["none" (default)|"2nd"] SFA preprocessing (see package rSFA-package
: [don't | do ormal SFA with 2nd degree expansion ]
[F] =T: replace the original numerical columns with the SFA columns; =F: add the SFA columns
[0] if >0: add monomials of degree 2 from the first PRE.SFA.npc columns (only acitve, if opts$PRE.SFA!="none")
[11] number of inputs after SFA preprocessing, only those inputs enter into SFA expansion
[5] number of SFA output dimensions (slowest signals) to return
[T] =F|T: don't | do parametric bootstrap for SFA in case of marginal training data
[sfaPBootstrap] the function to call in case of parametric bootstrap, see sfaPBootstrap
in package rSFA-package
for its interface description
[F] if =T, then use all non-validation data in the training-validation set for PCA or SFA preprocessing. If =F, use only the training set for PCA or SFA processing (only relevant if opts$PRE.PCA!="none" or opts$PRE.SFA!="none").
[0.99] bind the fraction 1-PRE.Xpgroup in column OTHER (see tdmPreGroupLevels
)
[32] bind the N-32+1 least frequent cases in column OTHER (see tdmPreGroupLevels
)
["xperc" (default) |"ndrop" |"nkeep" |"none" ] the method used for feature selection, see tdmModSortedRFimport
[0] how many variables to drop (only relevant if SRF.kind=="ndrop")
[NULL] how many variables to keep, NULL="keep all" (only relevant if SRF.kind=="nkeep")
[0.95] if >=0, keep that importance percentage, starting with the most important variables (if SRF.kind=="xperc")
[T] =T: calculate importance & save on SRF.file, =F: load from srfFile (srfFile = Output/<confFile>.SRF.Rdata)
[50] number of RF trees
sampsize for RF in importance estimation. See RF.samp for further info on sampsize.
[2]
[40] how many variables to show in plot
[1] a lower bound for the length of SRF$input.variables
["RFimp"]
[TRUE] option 'scale' for call importance() in tdmModSortedRFimport
[NULL] a seed for the random model initialization (if model is non-deterministic). If NULL, use tdmRandomSeed
.
["RF" (default) |"MC.RF" |"SVM" |"NB" ]: use [RF | MetaCost-RF | SVM | Naive Bayes ] in tdmClassify
["RF" (default) |"SVM" |"LM" ]: use [RF | SVM | linear model ] in tdmRegress
[500]
[1000] sampsize for RF in model training. If RF.samp is a scalar, then it specifies the total size of the sample. For classification, it can also be a vector of length n.class (= # of levels in response variable), then it specifies the size of each strata. The sum of the vector is the total sample size. If NULL, RF.samp will be replaced by 3000 later in tdmModAdjustSampsize*.
[NULL]
[1]
[TRUE] if =T, return OOB-training set error as tuning measure; if =F, return validation set error
[FALSE]
[3] =1: linear, =2: polynomial, =3: RBF, =4: sigmoid
[0.005] needed only for regression
[0.005]
[0.0] (needed only for opts$SVM.kernel=="polynomial" or =="sigmoid")
[3] (needed only for opts$SVM.kernel=="polynomial")
[0.008]
[1] =1: "Breiman", =2: "Freund", =3: "Zhu" as value for boosting(...,coeflearn,...) (AdaBoost)
[10] number of trees in AdaBoost = mfinal boosting(...,mfinal,...)
[20] minimum number of observations in a node in order for a split to be attempted
[NULL] vote fractions for the classes (vector of length n.class = # of levels in response variable). The class i with maximum ratio (% votes)/CLS.cutoff[i] wins. If NULL, then each class gets the cutoff 1/n.class (i.e. majority vote wins). The smaller CLS.cutoff[i], the more likely class i will win.
[NULL] class weights for the n.class classes, e.g.
c(A=10,B=20) for a 2-class problem with classes A and B
(the higher, the more costly is a misclassification of that real class). It should be a named vector with the same
length and names as the levels of the response variable. If no names are given, the levels of the response variables
in lexicographical order will be attached in tdmClassify
. CLS.CLASSWT=NULL for no weights.
[NULL] (n.class x n.class) gain matrix. If NULL, CLS.gainmat will be set to unit matrix in tdmClassify
["rgain" (default) |"meanCA" |"minCA" ] in case of tdmClassify
: For
classification, the measure Rgain
returned from tdmClassifyLoop
in
result$R_*
is [relative gain (i.e. gain/gainmax) | mean class accuracy | minimum
class accuracy | minus Y ]. The goal is to maximize Rgain
.
For binary classification there are the additional measures [ "arROC" | "arLIFT"
| "arPRE" | "bYouden" ], see 'Value' in tdmModConfmat
.
For regression, the goal is to minimize result$R_*
returned from tdmRegress
.
In this case, possible values are rgain.type
= ["rmae" (default) |"rmse" | "made" ]
which stands for [ relative mean absolute error | root mean squared error |
mean absolute deviation ].
[0] if >0, activate tdmParaBootstrap
in tdmClassify
[NULL] name of a function with signature (pred, dframe, opts)
where pred
is the prediction of the model on the
data frame dframe
and opts
is this list. This function may do some postprocessing on pred
and
it returns a (potentially modified) pred
. This function will be called in tdmClassify
if it is not NULL
.
["win"] ="win": all graphics to (several) windows (windows
or X11
in package grDevices
)
="rstudio": same as "win", but all graphics go to the RStudio device
="pdf": all graphics to one multi-page PDF
="png": all graphics in separate PNG files in opts$GD.PNGDIR
="non": no graphics at all
This concerns the TDMR graphics, not the SPOT (or other tuner) graphics.
If running R from RStudio (if there is a device with name "RStudioGD")
then the default "win" is changed to "rstudio" automatically.
[T] =T: restart the graphics device (i.e. close all 'old' windows or re-open
multi-page pdf) in each call to tdmClassify
or tdmRegress
, resp.
=F: leave all windows open (suitable for calls from SPOT) or write more pages in same pdf.
[T] =T: close graphics device "png", "pdf" at the end of main_*.r (suitable for main_*.r solo) or =F: do not close (suitable for call from tdmStartSpot2, where all windows should remain open)
[2] how many runs with different train & test samples - or - how many CV-runs, if opts$TST.kind
="cv"
[FALSE]
[FALSE]
["default cutoff"]
[2] =2: print much output, =1: less, =0: none
The path-related settings are relative to opts$path
, if it is def'd, else relative to the current dir.
Finally, the function tdmOptsDefaultsFill(opts)
is called to fill in further details, depending on the current
settings of opts
.