cinbag
implements a modified random forest algorithm (based on the source code from the randomForest package by Andy Liaw and Matthew Wiener and on the original Fortran code by Leo Breiman and Adele Cutler) to return the number of times a row appears in a tree's bag. cinbag
returns a randomForest
object, e.g., rfobj
, with an additional output, a matrix with inbag counts (rows) for each tree (columns). For instance, rfobj$inbagCount
is similar to rfobj$inbag
, but with inbag counts instead of inbag indicators.cinbag(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE, ...)
print
method, an randomForest
object).randomForest
will run in unsupervised mode.x
) containing
predictors for the test set.x
)
and regression (p/3).nodesize
). If set larger than maximum
possible, a warning is issued.TRUE
will override importance
.) TRUE
(default), the final result of votes
are expressed as fractions. If FALSE
, raw vote counts are
returned (useful for combining results from different runs).
Ignored for regression.TRUE
, give a more verbose output as
randomForest
is run. If set to some integer, then running
output is printed for every do.trace
trees.FALSE
, the forest will not be
retained in the output object. If xtest
is given, defaults
to FALSE
.n
by ntree
matrix be
returned that keeps track of which samples are ``in-bag'' in which
trees (but not how many times, if sampling with replacement)cinbag.default
.randomForest
, which is a list with the
following components:
randomForest
regression
, classification
, or
unsupervised
.nclass
+ 2 (for classification)
or two (for regression) columns. For classification, the first
nclass
columns are the class-specific measures computed as
mean descrease in accuracy. The nclass
+ 1st column is the
mean descrease in accuracy over all classes. The last column is the
mean decrease in Gini index. For Regression, the first column is
the mean decrease in accuracy and the second the mean decrease in MSE.
If importance=FALSE
, the last measure is still returned as a
vector.p
by nclass
+ 1
matrix corresponding to the first nclass + 1
columns
of the importance matrix. For regression, a length p
vector.NULL
if localImp=FALSE
.NULL
if
randomForest
is run in unsupervised mode or if
keep.forest=FALSE
.proximity=TRUE
when
randomForest
is called, a matrix of proximity measures among
the input (based on the frequency that pairs of data points are in
the same terminal nodes).n
.mse
/
Var(y).xtest
or additionally
ytest
arguments), this component is a list which contains the
corresponding predicted
, err.rate
, confusion
,
votes
(for classification) or predicted
, mse
and
rsq
(for regression) for the test set. If
proximity=TRUE
, there is also a component, proximity
,
which contains the proximity among the test set as well as proximity
between test and training data.randomForest
function's output, although it is implemented.randomForest
package. The purpose of the cinbag
function is to augment the randomForest
function so that it returns inbag counts. These counts are necessary for computing and ensembling the trees' empirical cumulative distribution functions.Breiman L (2002). Manual on setting up, using, and understanding random forests V3.1. http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf.
trimTrees
, hitRate
# Load the data
set.seed(201) # Can be removed; useful for replication
data <- as.data.frame(mlbench.friedman1(500, sd=1))
summary(data)
# Prepare data for trimming
train <- data[1:400, ]
test <- data[401:500, ]
xtrain <- train[,-11]
ytrain <- train[,11]
xtest <- test[,-11]
ytest <- test[,11]
# Run cinbag
set.seed(201) # Can be removed; useful for replication
rf <- cinbag(xtrain, ytrain, ntree=500, nodesize=5, mtry=3, keep.inbag=TRUE)
rf$inbag[,1] # First tree's inbag indicators
rf$inbagCount[,1] # First tree's inbag counts
Run the code above in your browser using DataLab