tdmClassify: Core classification function of TDMR.

Description

tdmClassify is called by tdmClassifyLoop and returns an object of class tdmClass. It trains a model on training set d_train and evaluates it on test set d_test. If this function is used for tuning, the test set d_test plays the role of a validation set.

Usage

tdmClassify(
  d_train,
  d_test,
  d_dis,
  d_preproc,
  response.variables,
  input.variables,
  opts,
  tsetStr = c("Validation", "validation")
)

Arguments

d_train

training set

d_test

validation set, same columns as training set

d_dis

'disregard set', i.e. everything what is neither train nor test. The model is applied to all records in d_dis (needed for active learning, see ssl_methods.r)

d_preproc

data used for preprocessing. May be NULL, if no preprocessing is done (opts$PRE.SFA=="none" and opts$PRE.PCA=="none"). If preprocessing is done, then d_preproc is usually all non-validation data.

response.variables

name of column which carries the target variable - or - vector of names specifying multiple target columns (these columns are not used during prediction, only for evaluation)

input.variables

vector with names of input columns

opts

additional parameters [defaults in brackets]

SRF.*: several parameters for tdmModSortedRFimport
RF.*: several parameters for RF (Random Forest, defaults are set, if omitted)
SVM.*: several parameters for SVM (Support Vector Machines, defaults are set, if omitted)
filename
data.title
MOD.method: ["RF"] the main training method ["RF"|"MC.RF"|"SVM"|"NB"]: use [Random forest| MetaCost-RF| SVM| Naive Bayes] for the main model
MOD.SEED: =NULL: get a new random number seed with tdmRandomSeed (different RF trainings). =any value: set the random number seed to this value (+i) to get reproducible random numbers. In this way, the model training part (RF, NNET, ...) gets always a fixed seed (see also TST.SEED in tdmClassifyLoop)
CLASSWT: class weights (NULL, if all classes should have the same weight) (currently used only by methods RF, MC.RF and by tdmModSortedRFimport)
fct.postproc: [NULL] name of user-def'd function for postprocessing of predicted output
GD.DEVICE: if !="non", then make a pairs-plot of the 5 most important variables and make a true-false bar plot
VERBOSE: [2] =2: most printed output, =1: less, =0: no output

tsetStr

[c("Validation", "validation")]

Value

res, an object of class tdmClass, this is a list containing

d_train

training set + predicted class column(s)

d_test

test set + predicted class column(s)

d_dis

disregard set + predicted class column(s)

avgEVAL

list with evaluation measures, averaged over all response variables

allEVAL

data frame with evaluation measures, one row for each response variable

lastCmTrain

a list with evaluation info for training set (confusion matrix, gain, class errors, ...)

lastCmVali

a list with evaluation info for validation set (confusion matrix, gain, class errors, ...)

lastModel

the last model built (i.e. for the last response variable)

lastProbs

a list with three probability matrices (row: records, col: classes) v_train, v_test, v_dis, if the model provides probabilities; NULL else.

lastPred

name of the colum where the prediction of the last model is appended to the datasets d_train, d_test and d_dis

predProb

a list with two data frames Trn and Val. They contain at least a column IND.dset (index of each train / validation record into data frame dset). If the model has probabilities, then they contain in addition a column for each response variable with the prediction probabilities.

opts

parameter list from input, some default values might have been added

The 9 evaluation measures in avgEVAL and allEVAL are cerr.* (misclassification errror), gain.* (total gain) and rgain.* (relative gain, i.e. total gain divided by max. achievable gain in *) where * = [trn | tst | tst2 ] stands for [ training set | test set | test set with special treatment ] and the special treatment is either opts$test2.string = "no postproc" or = "default cutoff". The five items lastCmTrain, lastCmVali, lastModel, lastProbs, lastPred are specific for the *last* model (the one built for the last response variable in the last run and last fold)

Details

Currently d_dis is allowed to be a 0-row data frame, but d_train and d_test must have at least one record.

Examples

Run this code

# NOT RUN {
#*# This demo shows a simple data mining process (phase 1 of TDMR) for classification on
#*# dataset iris.
#*# The data mining process in tdmClassify calls randomForest as the prediction model.
#*# It is called opts$NRUN=1 time with one random train-validation set splits.
#*# Therefore data frame res$allEval has one row
#*#
opts=tdmOptsDefaultsSet()                       # set all defaults for data mining process
gdObj <- tdmGraAndLogInitialize(opts);          # init graphics and log file

data(iris)
response.variables="Species"                    # names, not data (!)
input.variables=setdiff(names(iris),"Species")
opts$NRUN=1

idx_train = sample(nrow(iris))[1:110]
d_train=iris[idx_train,]
d_vali=iris[-idx_train,]
d_dis=iris[numeric(0),]
res <- tdmClassify(d_train,d_vali,d_dis,NULL,response.variables,input.variables,opts)

cat("\n")
print(res$allEVAL)

# }

Run the code above in your browser using DataLab