Build a Random Forest using importance=TRUE
. Usually the RF is smaller (50 trees), to speed up computation.
Use na.roughfix for missing value replacement.
Decide which input variables to keep and return them in SRF$input.variables
tdmModSortedRFimport(d_train, response.variable, input.variables, opts)
training set
the target column from d_train
to use for the RF-model
the input columns from d_train
to use for the RF-model
options, here we use the elements [defaults in brackets]:
SRF.kind:
="xperc": keep a certain importance percentage, starting from the most important variable
="ndrop": drop a certain number of least important variables
="nkeep": keep a certain number of most important variables
="none": do not call tdmModSortedRFimport
at all (see tdmRegress.r and tdmClassify.r)
SRF.ndrop: [0] how many variables to drop (if SRF.kind=="ndrop")
SRF.XPerc: [0.95] if >=0, keep that importance percentage, starting with the most important variables (if SRF.kind=="xperc")
SRF.calc: [TRUE] =TRUE: calculate importance & save on SRF.file, =F: load from SRF.file (SRF.file = Output/<filename>.SRF.<response.variable>.Rdata)
SRF.ntree: [50] number of RF trees
SRF.verbose: [2]
SRF.maxS: [40] how many variables to show in plot
SRF.minlsi: [1] a lower bound for the length of SRF$input.variables
RF.sampsize: sampsize for RF, set prior to calling this func via tdmModAdjustSampsize(opts$SRF.samp,...)
GD.DEVICE: if !="non", then make a bar plot on current graphic device
CLS.CLASSWT: class weight vector to use in random forest training
SRF
, a list with the following elements:
the vector of input variables which remain after importance processing. These are sorted by decreasing importance.
all input.variables sorted by decreasing (**NEW**) importance
the importance for s_input
vector with name of dropped variables
length of s_dropped
the percentage of total importance which is in the dropped variables
some defaults might have been added