bagging: Bagging Classification, Regression and Survival Trees

Description

Bagging for classification, regression and survival trees.

Usage

ipredbagg.factor(y, X=NULL, nbagg=25, control=
                 rpart.control(minsplit=2, cp=0, xval=0), 
                 comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
ipredbagg.numeric(y, X=NULL, nbagg=25, control=rpart.control(xval=0), 
                  comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
ipredbagg.Surv(y, X=NULL, nbagg=25, control=rpart.control(xval=0), 
               comb=NULL, coob=FALSE, ns=dim(y)[1], keepX = TRUE, ...)
## S3 method for class 'data.frame':
bagging(formula, data, subset, na.action=na.rpart, \dots)

Arguments

the response variable: either a factor vector of class labels (bagging classification trees), a vector of numerical values (bagging regression trees) or an object of class Surv<

a data frame of predictor variables.

nbagg

an integer giving the number of bootstrap replications.

coob

a logical indicating whether an out-of-bag estimate of the error rate (misclassification error, root mean squared error or Brier score) should be computed. See pred

control

options that control details of the rpart algorithm, see rpart.control. It is wise to set xval = 0 in order to save computing

comb

a list of additional models for model combination, see below for some examples. Note that argument method for double-bagging is no longer there, comb is much more flexible.

number of sample to draw from the learning sample. By default, the usual bootstrap n out of n with replacement is performed. If ns is smaller than length(y), subagging (Buehlmann and Yu, 2002)

keepX

a logical indicating whether the data frame of predictors should be returned. Note that the computation of the out-of-bag estimator requires keepX=TRUE.

formula

a formula of the form lhs ~ rhs where lhs is the response variable and rhs a set of predictors.

data

optional data frame containing the variables in the model formula.

subset

optional vector specifying a subset of observations to be used.

na.action

function which indicates what should happen when the data contain NAs. Defaults to na.rpart.

...

additional parameters passed to ipredbagg or rpart, respectively.

Value

The class of the object returned depends on class(y): classbagg, regbagg and survbagg. Each is a list with elements
ythe vector of responses.
Xthe data frame of predictors.
mtreesmultiple trees: a list of length nbagg containing the trees (and possibly additional objects) for each bootstrap sample.
OOBlogical whether the out-of-bag estimate should be computed.
errif OOB=TRUE, the out-of-bag estimate of misclassification or root mean squared error or the Brier score for censored data.
comblogical whether a combination of models was requested.
For each class methods for the generics prune, print, summary and predict are available for inspection of the results and prediction, for example: print.classbagg, summary.classbagg, predict.classbagg and prune.classbagg for classification problems.

Details

Bagging for classification and regression trees were suggested by Breiman (1996a, 1998) in order to stabilise trees.

The trees in this function are computed using the implementation in the rpart package. The generic function ipredbagg implements methods for different responses. If y is a factor, classification trees are constructed. For numerical vectors y, regression trees are aggregated and if y is a survival object, bagging survival trees (Hothorn et al, 2002) is performed. The function bagging offers a formula based interface to ipredbagg.

nbagg bootstrap samples are drawn and a tree is constructed for each of them. There is no general rule when to stop the tree growing. The size of the trees can be controlled by control argument or prune.classbagg. By default, classification trees are as large as possible whereas regression trees and survival trees are build with the standard options of rpart.control. If nbagg=1, one single tree is computed for the whole learning sample without bootstrapping.

If coob is TRUE, the out-of-bag sample (Breiman, 1996b) is used to estimate the prediction error corresponding to class(y). Alternatively, the out-of-bag sample can be used for model combination, an out-of-bag error rate estimator is not available in this case. Double-bagging (Hothorn and Lausen, 2003) computes a LDA on the out-of-bag sample and uses the discriminant variables as additional predictors for the classification trees. comb is an optional list of lists with two elements model and predict. model is a function with arguments formula and data. predict is a function with arguments object, newdata only. If the estimation of the covariance matrix in lda fails due to a limited out-of-bag sample size, one can use slda instead. See the example section for an example of double-bagging. The methodology is not limited to a combination with LDA: bundling (Hothorn and Lausen, 2002b) can be used with arbitrary classifiers.

References

Leo Breiman (1996a), Bagging Predictors. Machine Learning 24(2), 123--140.

Leo Breiman (1996b), Out-Of-Bag Estimation. Technical Report ftp://ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.Z.

Leo Breiman (1998), Arcing Classifiers. The Annals of Statistics 26(3), 801--824.

Peter Buehlmann and Bin Yu (2002), Analyzing Bagging. The Annals of Statistics 30(4), 927--961.

Torsten Hothorn and Berthold Lausen (2003), Double-Bagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36(6), 1303--1309.

Torsten Hothorn and Berthold Lausen (2002b), Bundling Classifiers by Bagging Trees. submitted. Preprint available from http://www.mathpreprints.com/math/Preprint/blausen/20021016/1.

Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2002), Bagging Survival Trees. submitted. Preprint available from http://www.mathpreprints.com/math/Preprint/blausen/20020518/2.

Examples

Run this code

# Classification: Breast Cancer data

data(BreastCancer)

# Test set error bagging (nbagg = 50): 3.7\% (Breiman, 1998, Table 5)

mod <- bagging(Class ~ Cl.thickness + Cell.size
                + Cell.shape + Marg.adhesion   
                + Epith.c.size + Bare.nuclei   
                + Bl.cromatin + Normal.nucleoli
                + Mitoses, data=BreastCancer, coob=TRUE)
print(mod)

# Test set error bagging (nbagg=50): 7.9\% (Breiman, 1996a, Table 2)

data(Ionosphere)
Ionosphere$V2 <- NULL # constant within groups

bagging(Class ~ ., data=Ionosphere, coob=TRUE)

# Double-Bagging: combine LDA and classification trees

# predict returns the linear discriminant values, i.e. linear combinations
# of the original predictors

comb.lda <- list(list(model=lda, predict=function(obj, newdata)
                                 predict(obj, newdata)$x))

# Note: out-of-bag estimator is not available in this situation, use
# errorest

mod <- bagging(Class ~ ., data=Ionosphere, comb=comb.lda) 

predict(mod, Ionosphere[1:10,])

# Regression:


data(BostonHousing)

# Test set error (nbagg=25, trees pruned): 3.41 (Breiman, 1996a, Table 8)

mod <- bagging(medv ~ ., data=BostonHousing, coob=TRUE)
print(mod)

learn <- as.data.frame(mlbench.friedman1(200))

# Test set error (nbagg=25, trees pruned): 2.47 (Breiman, 1996a, Table 8)

mod <- bagging(y ~ ., data=learn, coob=TRUE)
print(mod)

# Survival data

# Brier score for censored data estimated by 
# 10 times 10-fold cross-validation: 0.2 (Hothorn et al,
# 2002)

data(DLBCL)
mod <- bagging(Surv(time,cens) ~ MGEc.1 + MGEc.2 + MGEc.3 + MGEc.4 + MGEc.5 +
                                 MGEc.6 + MGEc.7 + MGEc.8 + MGEc.9 +
                                 MGEc.10 + IPI, data=DLBCL, coob=TRUE)

print(mod)

Run the code above in your browser using DataLab