rfTrain: Rapid Decision Tree Training

Description

Accelerated training using the Random Forest (trademarked name) algorithm. Tuned for multicore and GPU hardware. Bindable with most numerical front-end languages in addtion to R.

Usage

# S3 method for default
rfTrain(preFormat,
                 sampler,
                 y,
                autoCompress = 0.25,
                ctgCensus = "votes",
                classWeight = NULL,
                maxLeaf = 0,
                minInfo = 0.01,
                minNode = if (is.factor(y)) 2 else 3,
                nLevel = 0,
                nThread = 0,
                predFixed = 0,
                predProb = 0.0,
                predWeight = NULL, 
                regMono = NULL,
                splitQuant = NULL,
                thinLeaves = FALSE,
                treeBlock = 1,
                verbose = FALSE,
                ...)

Value

an object of class arbTrain, containing:

version the version of the Rborist package used to train.
samplerHash hash value of the Sampler object used to train. Recorded for consistency of subsequent commands.
predInfo a vector of forest-wide Gini (classification) or weighted variance (regression), by predictor.
predMap a vector of integers mapping internal to front-end predictor indices.
forest an object of class Forest containing:
- nTree the number of trees trained.
- node an object of class Node consisting of:
  - treeNode forest-wide vector of packed node representations.
  - extent per-tree node counts.
  - scores numeric vector of scores, for all terminals and nonterminals.
  - factor an object of class Factor consisting of:
    - facSplit forest-wide vector of packed factor bits.
    - extent per-tree extent of factor bits.
    - observed forest-wide vector of observed factor bits.
- Leaf an object of class Leaf containing:
  - extent forest-wide vector of leaf populations, i.e., counts of unique samples.
  - index forest-wide vector of sample indices.
diag diagnostics accumulated over the training task.

Arguments

y: the response (outcome) vector, either numerical or categorical.
preFormat: Compressed, presorted representation of the predictor values. Row count must conform with y.
sampler: Compressed representation of the sampled response.
autoCompress: plurality above which to compress predictor values.
ctgCensus: report categorical validation by vote or by probability.
classWeight: proportional weighting of classification categories.
maxLeaf: maximum number of leaves in a tree. Zero denotes no limit.
minInfo: information ratio with parent below which node does not split.
minNode: minimum number of distinct row references to split a node.
nLevel: maximum number of tree levels to train, including terminals (leaves). Zero denotes no limit.
nThread: suggests an OpenMP-style thread count. Zero denotes the default processor setting.
predFixed: number of trial predictors for a split (mtry).
predProb: probability of selecting individual predictor as trial splitter.
predWeight: relative weighting of individual predictors as trial splitters.
regMono: signed probability constraint for monotonic regression.
splitQuant: (sub)quantile at which to place cut point for numerical splits

thinLeaves: bypasses creation of leaf state in order to reduce memory footprint.
treeBlock: maximum number of trees to train during a single level (e.g., coprocessor computing).
verbose: indicates whether to output progress of training.
...: Not currently used.

Author

Mark Seligman at Suiji.

Examples

Run this code

if (FALSE) {
  # Regression example:
  nRow <- 5000
  x <- data.frame(replicate(6, rnorm(nRow)))
  y <- with(x, X1^2 + sin(X2) + X3 * X4) # courtesy of S. Welling.

  # Classification example:
  data(iris)

  # Generic invocation:
  rt <- rfTrain(y)


  # Causes 300 trees to be trained:
  rt <- rfTrain(y, nTree = 300)


  # Causes validation census to report class probabilities:
  rt <- rfTrain(iris[-5], iris[5], ctgCensus="prob")


  # Applies table-weighting to classification categories:
  rt <- rfTrain(iris[-5], iris[5], classWeight = "balance")


  # Weights first category twice as heavily as remaining two:
  rt <- rfTrain(iris[-5], iris[5], classWeight = c(2.0, 1.0, 1.0))


  # Does not split nodes when doing so yields less than a 2% gain in
  # information over the parent node:
  rt <- rfTrain(y, preFormat, sampler, minInfo=0.02)


  # Does not split nodes representing fewer than 10 unique samples:
  rt <- rfTrain(y, preFormat, sampler, minNode=10)


  # Trains a maximum of 20 levels:
  rt <- rfTrain(y, preFormat, sampler, nLevel = 20)


  # Trains, but does not perform subsequent validation:
  rt <- rfTrain(y, preFormat, sampler, noValidate=TRUE)


  # Chooses 500 rows (with replacement) to root each tree.
  rt <- rfTrain(y, preFormat, sampler, nSamp=500)


  # Chooses 2 predictors as splitting candidates at each node (or
  # fewer, when choices exhausted):
  rt <- rfTrain(y, preFormat, sampler, predFixed = 2)  


  # Causes each predictor to be selected as a splitting candidate with
  # distribution Bernoulli(0.3):
  rt <- rfTrain(y, preFormat, sampler, predProb = 0.3) 


  # Causes first three predictors to be selected as splitting candidates
  # twice as often as the other two:
  rt <- rfTrain(y, preFormat, sampler, predWeight=c(2.0, 2.0, 2.0, 1.0, 1.0))


  # Constrains modelled response to be increasing with respect to X1
  # and decreasing with respect to X5.
  rt <- rfTrain(x, y, preFormat, sampler, regMono=c(1.0, 0, 0, 0, -1.0, 0))


  # Suppresses creation of detailed leaf information needed for
  # quantile prediction and external tools.
  rt <- rfTrain(y, preFormat, sampler, thinLeaves = TRUE)

  spq <- rep(0.5, ncol(x))
  spq[0] <- 0.0
  spq[1] <- 1.0
  rt <- rfTrain(y, preFormat, sampler, splitQuant = spq)
  }