Learn R Programming

rbart (version 1.0)

rbart: Fitting Bayesian Regression Tree models supported by rbart.

Description

The function rbart() is the main function for fitting Bayesian Regression Tree models, including single-tree models, Bayesian Additive Regression Tree (BART) models and Heteroscedastic BART models. rbart() maintains some degree of backwards compatibility with BayesTree::bart(), while offering many new options.

Usage

rbart(x.train, y.train, x.test=matrix(0.0,0,0), ntree=200, ntreeh=40, ndpost=1000, 
     nskip=100, k=2, power=2.0, base=.95, tc=1, sigmav=rep(1,length(y.train)), 
     fmean=mean(y.train), overallsd = sd(y.train), overallnu=10, 
     chv = cor(x.train,method="spearman"), pbd=.7, pb=.5, stepwpert=.1, probchv=.1,
     minnumbot=5, printevery=100, numcut=100, xicuts=NULL, nadapt=1000, adaptevery=100,
     summarystats=FALSE)

Arguments

x.train

nxp matrix of predictor variables for the training data.

y.train

nx1 vector of the observed response for the training data.

x.test

mxp matrix of predictor variables for the test set. Deprecated.

ntree

Number of trees used for the mean model.

ntreeh

Number of trees used for the variance model.

ndpost

Number of iterations to run the MCMC algorithm after burn-in.

nskip

Number of MCMC iterations treated as burn-in and discarded.

k

Prior hyperparameter for the mean model.

power

Power parameter in the tree depth penalizing prior.

base

Base parameter in the tree depth penalizing prior.

tc

Number of OpenMP threads to use.

sigmav

Initialization of square-root of variance parameter.

fmean

Overall mean of the data for pre-centering the data before running the model.

overallsd

A rudimentary estimate of the process standard deviation. Used in calibrating the variance prior.

overallnu

Shape parameter for the variance prior.

chv

Predictor correlation matrix used as a pre-conditioner for MCMC change-of-variable proposals.

pbd

Probability of performing a birth/death proposal, otherwise perform a rotate proopsal.

pb

Probability of performing a birth proposal given that we choose to perform a birth/death proposal.

stepwpert

Initial width of proposal distribution for peturbing cutpoints.

probchv

Probability of performing a change-of-variable proposal. Otherwise, only do a perturb proposal

minnumbot

Minimum number of observations required in bottom (terminal) nodes.

printevery

Outputs MCMC algorithm status every printevery iterations.

numcut

Number of cutpoints to use for each predictor variable.

xicuts

More detailed construction of cutpoints can be specified using makecuts() and passed as an argument here.

nadapt

Number of MCMC iterations allowed for adaptive MCMC. These are also discarded.

adaptevery

Adapt MCMC proposal distributions every adaptevery iterations until the algorithm has run for nadapt iterations.

summarystats

Return detailed summary statistics about the fitting procedure.

Value

res

Fitted model object of S3 class rbart.

Details

rbart() is the main model fitting function for continuous response data. The most general form of the model allowed is \(Y({\bf x})=f({\bf x})+s({\bf x})Z\) where \(Z\) is \(N(0,1)\) and \(f({\bf x})=\sum_{j=1}^m g({\bf x};T_j,M_j)\) and \(s({\bf x})=\prod_{j=1}^{m^\prime} h({\bf x};T^\prime_j,M^\prime_j)\), where the \(g(\cdot;T_j,M_j)\) represent additive tree components used for modeling the mean and \(h(\cdot;T^\prime_j,M^\prime_j)\) represent multiplicative tree components used for modeling the variance.

The most common models to fit are a homoscedastic single-tree model, a homoscedastic BART model and a heteroscedastic BART model.

For a BART model, set pbd=c(0.7,0.0) and ntreeh=1. This forces a scalar (homoscedastic) variance term.

For a single-tree model, set pbd=(0.7,0.0), ntreeh=1 and ntree=1. This forces the mean component to be modeled using only one tree.

The heteroscedastic BART model is the default.

References

Chipman, Hugh A., George, Edward I., and McCulloch, Robert E. (1998) Bayesian CART model search. Journal of the American Statistical Association, 93, 935--948.

Chipman, Hugh A., George, Edward I., and McCulloch, Robert E. (2010) BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4, 266--298.

Pratola, Matthew T. (2016) Efficient Metropolis Hastings proposal mechanisms for Bayesian regression tree models. Bayesian analysis, 11, 885--911.

Pratola, Matthew T., Chipman, Hugh A., George, Edward I. and McCulloch, Robert E. (2017) Heteroscedastic BART Using Multiplicative Regression Trees. arXiv preprint, arXiv:1709.07542, 1--20.

See Also

predict.rbart

Examples

Run this code
# NOT RUN {
##################################################
## This is just a stub (runs fast) example for testing.
##  For more realistic examples, please see:
##   (i) the vignette at www.rob-mcculloch.org
##   (ii) the example simulated data (see ?simdat)
##        and the longer run in ?rbartonsimd, 
##        where a saved run of rbart is run on simdat is plotted. 
##################################################

##simulate data
set.seed(99)

# train data
n=500 #train data sample size
p=1 #just one x
x = matrix(sort(runif(n*p)),ncol=p) #iid uniform x values
fx = 4*(x[,1]^2) #quadratric function f
sx = .2*exp(2*x[,1]) # exponential function s
y = fx + sx*rnorm(n) # y = f(x) + s(x) Z

#test data (the p added to the variable names is for predict)
np=500 #test data sample size
xp = matrix(sort(runif(np*p)),ncol=p)
fxp = 4*(xp[,1]^2)
sxp = .2*exp(2*xp[,1])
yp = fxp + sxp*rnorm(np)

##run rbart MCMC
# The number of interations is kept small to make example run,
##!!!!  REAL APPLICATIONS MAY NEED LONGER RUNS !!!!
#   nskip: burn in draws,
#   ndpost:kept draws,
#   nadapt: initial draws to tune MCMC,
#   numcut: number of cutpoints used for each x
#   k: bigger k gives smoother f (default is 2)
set.seed(19)
res = rbart(x,y,nskip=10,ndpost=20,nadapt=0,numcut=1000,k=5) #again, this is way too short a run!!!
## now predict to get inference
resp = predict(res,x.test=xp)

##check out of sample fit
cat("out of sample cor(f,fhat) is ",cor(fxp,resp$mmean),"\n")
cat("out of sample cor(s,shat) is ",cor(sxp,resp$smean),"\n")

##plot estimated vs. true
##plot the data
plot(xp,yp,cex.axis=1.5,cex.lab=1.5)
lines(xp,fxp,col="blue")
lines(xp,fx+2*sxp,col="blue",lty=2)
lines(xp,fxp-2*sxp,col="blue",lty=2)

## add the fit
lines(xp,resp$mmean) #estimate of f
lines(xp,resp$mmean+2*resp$smean) #estimate of sd
lines(xp,resp$mmean-2*resp$smean) #estimate of sd
# }

Run the code above in your browser using DataLab