Build a random causal forest by fitting a user selected number of
causalTree
models to get an ensemble of rpart
objects.
init.causalForest(
formula,
data,
treatment,
weights = FALSE,
cost = FALSE,
num.trees,
ncov_sample
)# S3 method for causalForest
predict(object, newdata, predict.all = FALSE, type = "vector", ...)
causalForest(
formula,
data,
treatment,
na.action = na.causalTree,
split.Rule = "CT",
double.Sample = TRUE,
split.Honest = TRUE,
split.Bucket = FALSE,
bucketNum = 5,
bucketMax = 100,
cv.option = "CT",
cv.Honest = TRUE,
minsize = 2L,
propensity,
control,
split.alpha = 0.5,
cv.alpha = 0.5,
sample.size.total = floor(nrow(data)/10),
sample.size.train.frac = 0.5,
mtry = ceiling(ncol(data)/3),
nodesize = 1,
num.trees = nrow(data),
cost = FALSE,
weights = FALSE,
ncolx,
ncov_sample
)
An object of class rpart
. See rpart.object
.
a formula, with a response and features but no
interaction terms. If this a a data frome, that is taken as the model frame
(see model.frame).
an optional data frame that includes the variables named in the formula.
a vector that indicates the treatment status of each observation. 1 represents treated and 0 represents control. Only binary treatment supported in this version.
optional case weights.
a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose.
Number of trees to be built in the causal forest
Number of covariates randomly sampled to build each tree in the forest
a causalTree
object
new data to predict
If TRUE, return predicted individual effect for each observations. Otherwise, return the average effect.
the type of returned object
arguments to rpart.control
may also be
specified in the call to causalForest
. They are
checked against the
list of valid arguments.
The parameter minsize
is implemented differently in
causalTree
than in rpart
; we require a minimum of minsize
treated observations and a minimum of minsize
control
observations in each leaf.
the default action deletes all observations for which
y
is missing, but keeps those in which one or more predictors
are missing.
causalTree splitting options, one of "TOT"
,
"CT"
, "fit"
, "tstats"
, four splitting rules in
causalTree
. Note that the "tstats"
alternative does
not have an associated cross-validation method cv.option
;
see Athey and Imbens (2016)
for a discussion. Note further that split.Rule
and
cv.option
can mix and match.
boolean option, TRUE
or FALSE
,
if set to True, causalForest will build honest trees.
boolean option, TRUE
or FALSE
, used
to decide the splitting rule of the trees.
boolean option, TRUE
or FALSE
,
used to specify whether to apply the discrete method in splitting the tree.
If set as TRUE
, in splitting a node, the observations in a leaf
will be be partitioned into buckets, with each bucket containing
bucketNum
treated and bucketNum
control units, and where
observations are ordered prior to partitioning. Splitting will take
place by bucket.
number of observations in each bucket when set
split.Bucket
= TRUE
. However, the code will override
this choice in order to guarantee that there are at least minsize
and at most bucketMax
buckets.
Option to choose maximum number of buckets to use in
splitting when set split.Bucket
= TRUE
, bucketNum
can change by choice of bucketMax
.
cross validation options, one of "TOT"
,
"matching"
, "CT"
, "fit"
, four cross validation
methods in causalTree. There is no cv.option
for
the split.Rule
"tstats"
; see Athey and Imbens (2016)
for discussion.
boolean option, TRUE
or FALSE
, only
used for cv.option
as "CT"
or "fit"
, to specify
whether to apply honest risk evalation function in cross validation.
If set TRUE
, use honest risk function, otherwise use adaptive
risk function in cross validation. If set FALSE
, the user
choice of cv.alpha
will be set to 1. If set
TRUE
, cv.alpha
will default to 0.5, but the user choice of cv.alpha
will be
respected. Note that honest cv estimates within-leaf variances and
may perform better with larger leaf sizes and/or small number of
cross-validation sets.
in order to split, each leaf must have at least
minsize
treated cases and minsize
control cases.
The default value is set as 2.
propensity score used in "TOT"
splitting
and "TOT"
, honest "CT"
cross validation methods.
The default value is the proportion of treated cases in all observations.
In this implementation, the propensity score is a constant for the whole
dataset. Unit-specific propensity scores are not supported; however,
the user may use inverse propensity scores as case weights if desired.
a list of options that control details of the
rpart
algorithm. See rpart.control
.
scale parameter between 0 and 1, used in splitting
risk evaluation function for "CT"
. When split.Honest = FALSE
,
split.alpha
will be set as 1. For split.Rule
="tstats"
,
if split.Honest
=TRUE
, split.alpha
is used in
calculating the risk function, which determines the order of
pruning in cross-validation.
scale paramter between 0 and 1, used in cross validation
risk evaluation function for "CT"
and "fit"
. When
cv.Honest = FALSE
, cv.alpha
will be set as 1.
Sample size used to build each tree in the forest (sampled randomly with replacement).
Fraction of the sample size used for building each tree (training). For eexample, if the sample.size.total is 1000 and frac =0.5 then, 500 samples will be used to build the tree and the other 500 samples will be used the evaluate the tree.
Number of data features used to build a tree (This variable is not used presently).
Minimum number of observations for treated and control cases in one leaf node
Total number of covariates
CausalForest builds an ensemble of CausalTrees (See Athey and Imbens,
Recursive Partitioning for Heterogeneous Causal
Effects (2016)), by repeated random sampling of the data with replacement.
Further, each tree is built using a randomly sampled subset of all available
covariates. A causal forest object is a list of trees. To predict, call R's
predict function with new test data and the causalForest object (estimated
on the training data) obtained after calling the causalForest function.
During the prediction phase, the average value over all tree predictions
is returned as the final prediction by default.
To return the predictions of each tree in the forest for each test
observation, set the flag predict.all=TRUE
CausalTree differs from rpart
function from rpart package in
splitting rules and cross validation methods. Please check Athey
and Imbens, Recursive Partitioning for Heterogeneous Causal
Effects (2016) and Stefan Wager and Susan Athey, Estimation and
Inference of Heterogeneous Treatment Effects using Random Forests
for more details.
Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.
Athey, S and G Imbens (2016) Recursive Partitioning for Heterogeneous Causal Effects. http://arxiv.org/abs/1504.01132
Wager,S and Athey, S (2015) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests http://arxiv.org/abs/1510.04342
causalTree
honest.causalTree
,
rpart.control
, rpart.object
,
summary.rpart
, rpart.plot