Fit a causalTree
model to get an honest causal tree,
with tree structure built on training sample (including cross-validation)
and leaf estimates taken from estimation sample.
Return an rpart
object.
honest.causalTree(
formula,
data,
weights,
treatment,
subset,
est_data,
est_weights,
est_treatment,
est_subset,
na.action = na.causalTree,
split.Rule,
split.Honest,
HonestSampleSize,
split.Bucket,
bucketNum = 10,
bucketMax = 40,
cv.option,
cv.Honest,
minsize = 2L,
model = FALSE,
x = FALSE,
y = TRUE,
propensity,
control,
split.alpha = 0.5,
cv.alpha = 0.5,
cv.gamma = 0.5,
split.gamma = 0.5,
cost,
...
)
An object of class rpart
. See rpart.object
.
a formula, with a response and features but
no interaction terms. If this a a data frome, that is taken as
the model frame (see model.frame).
an optional data frame that includes the variables named in the formula.
optional case weights.
a vector that indicates the treatment status of each observation. 1 represents treated and 0 represents control. Only binary treatment supported in this version.
optional expression saying that only a subset of the rows of the data should be used in the fit.
data frame to be used for leaf estimates; the estimation sample. Must contain the variables used in training the tree.
optional case weights for estimation sample
treatment vector for estimation sample. Must be same length as estimation data. A vector indicates the treatment status of the data, 1 represents treated and 0 represents control. Only binary treatment supported in this version.
optional expression saying that only a subset of the rows of the estimation data should be used in the fit of the re-estimated tree.
the default action deletes all observations for which
y
is missing, but keeps those in which one or more predictors
are missing.
causalTree splitting options, one of "TOT"
,
"CT"
, "fit"
, "tstats"
, four splitting rules in
causalTree
. Note that the "tstats"
alternative does
not have an associated cross-validation method cv.option
;
see Athey and Imbens (2016) for a discussion. Note further
that split.Rule
and cv.option
can mix and match.
boolean option, TRUE
or FALSE
,
used for split.Rule
as "CT"
or "fit"
.
If set as TRUE
, do honest splitting, with default
split.alpha
= 0.5; if set as FALSE
, do adaptive
splitting with split.alpha
= 1. The user choice of
split.alpha
will be ignored if split.Honest
is set
as FALSE
, but will be respected
if set to TRUE
. For split.Rule
="TOT"
,
there is no honest splitting option and
the parameter split.alpha
does not matter.
For split.Rule
="tstats"
, a value of TRUE
enables use of split.alpha
in calculating the risk function,
which determines the order of pruning in cross-validation. Note also
that causalTree function returns the estimates from the training data,
no matter what the value of split.Honest
is; the tree must be
re-estimated to get the honest estimates using estimate.causalTree
.
The wrapper function honest.CausalTree
does honest estimation in one step and returns a tree.
number of observations anticipated to be used in honest re-estimation after building the tree. This enters the risk function used in both splitting and cross-validation.
boolean option, TRUE
or FALSE
,
used to specify whether to apply the discrete method in splitting the tree.
If set as TRUE
, in splitting a node, the observations in a leaf will
be be partitioned into buckets, with each bucket containing bucketNum
treated and bucketNum
control units, and where observations are
ordered prior to partitioning. Splitting will take place by bucket.
number of observations in each bucket when set
split.Bucket
= TRUE
. However, the code will override
this choice in order to guarantee that there are at least minsize
and at most bucketMax
buckets.
Option to choose maximum number of buckets to use in
splitting when set split.Bucket
= TRUE
,
bucketNum
can change by choice of bucketMax
.
cross validation options, one of "TOT"
,
"matching"
, "CT"
, "fit"
, four cross validation
methods in causalTree. There is no cv.option
for the
split.Rule
"tstats"
; see Athey and Imbens (2016) for
discussion.
boolean option, TRUE
or FALSE
,
only used for cv.option
as "CT"
or "fit"
,
to specify whether to apply honest risk evalation function in cross
validation. If set TRUE
, use honest risk function, otherwise use
adaptive risk function in cross validation. If set FALSE
, the user
choice of cv.alpha
will be set to 1. If set TRUE
,
cv.alpha
will default to 0.5, but the user choice of
cv.alpha
will be respected. Note that honest cv estimates
within-leaf variances and may perform better with larger leaf sizes
and/or small number of cross-validation sets.
in order to split, each leaf must have at least
minsize
treated cases and minsize
control cases.
The default value is set as 2.
model frame of causalTree
, same as rpart
keep a copy of the x
matrix in the result.
keep a copy of the dependent variable in the result. If
missing and model
is supplied this defaults to FALSE
.
propensity score used in "TOT"
splitting
and "TOT"
, honest "CT"
cross validation methods. The
default value is the proportion of treated cases in all observations.
In this implementation, the propensity score is a constant for the whole
dataset. Unit-specific propensity scores are not supported; however,
the user may use inverse propensity scores as case weights if desired.
a list of options that control details of the
rpart
algorithm. See rpart.control
.
scale parameter between 0 and 1, used in splitting
risk evaluation function for "CT"
. When split.Honest = FALSE
,
split.alpha
will be set as 1. For split.Rule
="tstats"
,
if split.Honest
=TRUE
, split.alpha
is used in
calculating the risk function, which determines the order of pruning
in cross-validation.
scale paramter between 0 and 1, used in cross validation
risk evaluation function for "CT"
and "fit"
. When
cv.Honest = FALSE
, cv.alpha
will be set as 1.
optional parameters used in evaluating policies.
a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose.
arguments to rpart.control
may also be
specified in the call to causalTree
. They are checked against the
list of valid arguments. An example of a commonly set parameter would
be xval
, which sets the number of cross-validation samples.
The parameter minsize
is implemented differently in
causalTree
than in rpart
; we require a minimum of minsize
treated observations and a minimum of minsize
control
observations in each leaf.
Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.
Athey, S and G Imbens (2016) Recursive Partitioning for Heterogeneous Causal Effects. http://arxiv.org/abs/1504.01132
causalTree
,
estimate.causalTree
, rpart.object
,
summary.rpart
, rpart.plot