This implementation of the random forest (and bagging) algorithm differs
from the reference implementation in randomForest
with respect to the base learners used and the aggregation scheme applied.
Conditional inference trees, see ctree
, are fitted to each
of the ntree
(defined via cforest_control
)
bootstrap samples of the learning sample. Most of the hyper parameters in
cforest_control
regulate the construction of the conditional inference trees.
Therefore you MUST NOT change anything you don't understand completely.
Hyper parameters you might want to change in cforest_control
are:
1. The number of randomly preselected variables mtry
, which is fixed
to the value 5 by default here for technical reasons, while in
randomForest
the default values for classification and regression
vary with the number of input variables.
2. The number of trees ntree
. Use more trees if you have more variables.
3. The depth of the trees, regulated by mincriterion
. Usually unstopped and unpruned
trees are used in random forests. To grow large trees, set mincriterion
to a small value.
The aggregation scheme works by averaging observation weights extracted
from each of the ntree
trees and NOT by averaging predictions directly
as in randomForest
.
See Hothorn et al. (2004) for a description.
Predictions can be computed using predict
. For observations
with zero weights, predictions are computed from the fitted tree
when newdata = NULL
. While predict
returns predictions
of the same type as the response in the data set by default (i.e., predicted class labels for factors),
treeresponse
returns the statistics of the conditional distribution of the response
(i.e., predicted class probabilities for factors). The same is done by predict(..., type = "prob")
.
Note that for multivariate responses predict
does not convert predictions to the type
of the response, i.e., type = "prob"
is used.
Ensembles of conditional inference trees have not yet been extensively
tested, so this routine is meant for the expert user only and its current
state is rather experimental. However, there are some things available
in cforest
that can't be done with randomForest
,
for example fitting forests to censored response variables (see Hothorn et al., 2006a) or to
multivariate and ordered responses.
Moreover, when predictors vary in their scale of measurement of number
of categories, variable selection and computation of variable importance is biased
in favor of variables with many potential cutpoints in randomForest
,
while in cforest
unbiased trees and an adequate resampling scheme
are used by default. See Hothorn et al. (2006b) and Strobl et al. (2007)
as well as Strobl et al. (2009).
The proximity
matrix is an \(n \times n\) matrix \(P\) with \(P_{ij}\)
equal to the fraction of trees where observations \(i\) and \(j\)
are element of the same terminal node (when both \(i\) and \(j\)
had non-zero weights in the same bootstrap sample).