There are four ways to specify the distance
argument: 1) as a string containing the name of a method for
estimating propensity scores, 2) as a string containing the name of a method
for computing pairwise distances from the covariates, 3) as a vector of
values whose pairwise differences define the distance between units, or 4)
as a distance matrix containing all pairwise distances. The options are
detailed below.
Propensity score estimation methods
When distance
is specified as the name of a method for estimating propensity scores
(described below), a propensity score is estimated using the variables in
formula
and the method corresponding to the given argument. This
propensity score can be used to compute the distance between units as the
absolute difference between the propensity scores of pairs of units.
Propensity scores can also be used to create calipers and common support
restrictions, whether or not they are used in the actual distance measure
used in the matching, if any.
In addition to the distance
argument, two other arguments can be
specified that relate to the estimation and manipulation of the propensity
scores. The link
argument allows for different links to be used in
models that require them such as generalized linear models, for which the
logit and probit links are allowed, among others. In addition to specifying
the link, the link
argument can be used to specify whether the
propensity score or the linearized version of the propensity score should be
used; by specifying link = "linear.{link}"
, the linearized version
will be used.
The distance.options
argument can also be specified, which should be
a list of values passed to the propensity score-estimating function, for
example, to choose specific options or tuning parameters for the estimation
method. If formula
, data
, or verbose
are not supplied
to distance.options
, the corresponding arguments from
matchit()
will be automatically supplied. See the Examples for
demonstrations of the uses of link
and distance.options
. When
s.weights
is supplied in the call to matchit()
, it will
automatically be passed to the propensity score-estimating function as the
weights
argument unless otherwise described below.
The following methods for estimating propensity scores are allowed:
"glm"
The propensity scores are estimated using
a generalized linear model (e.g., logistic regression). The formula
supplied to matchit()
is passed directly to glm()
, and
predict.glm()
is used to compute the propensity scores. The link
argument can be specified as a link function supplied to binomial()
, e.g.,
"logit"
, which is the default. When link
is prepended by
"linear."
, the linear predictor is used instead of the predicted
probabilities. distance = "glm"
with link = "logit"
(logistic
regression) is the default in matchit()
. (This used to be able to be requested as distance = "ps"
, which still works.)
"gam"
The propensity scores are estimated using a generalized additive model. The
formula
supplied to matchit()
is passed directly to
mgcv::gam()
, and mgcv::predict.gam()
is used to compute the propensity
scores. The link
argument can be specified as a link function
supplied to binomial()
, e.g., "logit"
, which is the default. When
link
is prepended by "linear."
, the linear predictor is used
instead of the predicted probabilities. Note that unless the smoothing
functions mgcv::s()
, mgcv::te()
, mgcv::ti()
, or mgcv::t2()
are
used in formula
, a generalized additive model is identical to a
generalized linear model and will estimate the same propensity scores as
glm()
. See the documentation for mgcv::gam()
,
mgcv::formula.gam()
, and mgcv::gam.models()
for more information on
how to specify these models. Also note that the formula returned in the
matchit()
output object will be a simplified version of the supplied
formula with smoothing terms removed (but all named variables present).
"gbm"
The propensity scores are estimated using a
generalized boosted model. The formula
supplied to matchit()
is passed directly to gbm::gbm()
, and gbm::predict.gbm()
is used to
compute the propensity scores. The optimal tree is chosen using 5-fold
cross-validation by default, and this can be changed by supplying an
argument to method
to distance.options
; see gbm::gbm.perf()
for details. The link
argument can be specified as "linear"
to
use the linear predictor instead of the predicted probabilities. No other
links are allowed. The tuning parameter defaults differ from
gbm::gbm()
; they are as follows: n.trees = 1e4
,
interaction.depth = 3
, shrinkage = .01
, bag.fraction = 1
, cv.folds = 5
, keep.data = FALSE
. These are the same
defaults as used in WeightIt and twang, except for
cv.folds
and keep.data
. Note this is not the same use of
generalized boosted modeling as in twang; here, the number of trees is
chosen based on cross-validation or out-of-bag error, rather than based on
optimizing balance. twang should not be cited when using this method
to estimate propensity scores.
"lasso"
, "ridge"
, "elasticnet"
The propensity
scores are estimated using a lasso, ridge, or elastic net model,
respectively. The formula
supplied to matchit()
is processed
with model.matrix()
and passed to glmnet::cv.glmnet()
, and
glmnet::predict.cv.glmnet()
is used to compute the propensity scores. The
link
argument can be specified as a link function supplied to
binomial()
, e.g., "logit"
, which is the default. When link
is prepended by "linear."
, the linear predictor is used instead of
the predicted probabilities. When link = "log"
, a Poisson model is
used. For distance = "elasticnet"
, the alpha
argument, which
controls how to prioritize the lasso and ridge penalties in the elastic net,
is set to .5 by default and can be changed by supplying an argument to
alpha
in distance.options
. For "lasso"
and
"ridge"
, alpha
is set to 1 and 0, respectively, and cannot be
changed. The cv.glmnet()
defaults are used to select the tuning
parameters and generate predictions and can be modified using
distance.options
. If the s
argument is passed to
distance.options
, it will be passed to predict.cv.glmnet()
.
Note that because there is a random component to choosing the tuning
parameter, results will vary across runs unless a seed is
set.
"rpart"
The propensity scores are estimated using a
classification tree. The formula
supplied to matchit()
is
passed directly to rpart::rpart()
, and rpart::predict.rpart()
is used
to compute the propensity scores. The link
argument is ignored, and
predicted probabilities are always returned as the distance measure.
"randomforest"
The propensity scores are estimated using a
random forest. The formula
supplied to matchit()
is passed
directly to randomForest::randomForest()
, and
randomForest::predict.randomForest()
is used to compute the propensity
scores. The link
argument is ignored, and predicted probabilities are
always returned as the distance measure.
"nnet"
The
propensity scores are estimated using a single-hidden-layer neural network.
The formula
supplied to matchit()
is passed directly to
nnet::nnet()
, and fitted()
is used to compute the propensity scores.
The link
argument is ignored, and predicted probabilities are always
returned as the distance measure. An argument to size
must be
supplied to distance.options
when using method = "nnet"
.
"cbps"
The propensity scores are estimated using the
covariate balancing propensity score (CBPS) algorithm, which is a form of
logistic regression where balance constraints are incorporated to a
generalized method of moments estimation of of the model coefficients. The
formula
supplied to matchit()
is passed directly to
CBPS::CBPS()
, and fitted()
is used to compute the propensity
scores. The link
argument can be specified as "linear"
to use
the linear predictor instead of the predicted probabilities. No other links
are allowed. The estimand
argument supplied to matchit()
will
be used to select the appropriate estimand for use in defining the balance
constraints, so no argument needs to be supplied to ATT
in
CBPS
.
"bart"
The propensity scores are estimated
using Bayesian additive regression trees (BART). The formula
supplied
to matchit()
is passed directly to dbarts::bart2()
,
and dbarts::fitted.bart()
is used to compute the propensity
scores. The link
argument can be specified as "linear"
to use
the linear predictor instead of the predicted probabilities. When
s.weights
is supplied to matchit()
, it will not be passed to
bart2
because the weights
argument in bart2
does not
correspond to sampling weights.
Methods for computing distances from covariates
The following methods involve computing a distance matrix from the covariates themselves
without estimating a propensity score. Calipers on the distance measure and
common support restrictions cannot be used, and the distance
component of the output object will be empty because no propensity scores
are estimated. The link
and distance.options
arguments are
ignored with these methods. See the individual matching methods pages for
whether these distances are allowed and how they are used. Each of these
distance measures can also be calculated outside matchit()
using its
corresponding function.
"euclidean"
The Euclidean distance is the raw
distance between units, computed as $$d_{ij} = \sqrt{(x_i - x_j)(x_i -
x_j)'}$$ It is sensitive to the scale of the covariates, so covariates with
larger scales will take higher priority.
"scaled_euclidean"
The scaled Euclidean distance is the
Euclidean distance computed on the scaled (i.e., standardized) covariates.
This ensures the covariates are on the same scale. The covariates are
standardized using the pooled within-group standard deviations, computed by
treatment group-mean centering each covariate before computing the standard
deviation in the full sample.
"mahalanobis"
The
Mahalanobis distance is computed as $$d_{ij} = \sqrt{(x_i -
x_j)\Sigma^{-1}(x_i - x_j)'}$$ where \(\Sigma\) is the pooled within-group
covariance matrix of the covariates, computed by treatment group-mean
centering each covariate before computing the covariance in the full sample.
This ensures the variables are on the same scale and accounts for the
correlation between covariates.
"robust_mahalanobis"
The
robust rank-based Mahalanobis distance is the Mahalanobis distance computed
on the ranks of the covariates with an adjustment for ties. It is described
in Rosenbaum (2010, ch. 8) as an alternative to the Mahalanobis distance
that handles outliers and rare categories better than the standard
Mahalanobis distance but is not affinely invariant.
To perform Mahalanobis distance matching and estimate propensity
scores to be used for a purpose other than matching, the mahvars
argument should be used along with a different specification to
distance
. See the individual matching method pages for details on how
to use mahvars
.
Distances supplied as a numeric vector or matrix
distance
can also be supplied as a numeric vector whose values will be taken to
function like propensity scores; their pairwise difference will define the
distance between units. This might be useful for supplying propensity scores
computed outside matchit()
or resupplying matchit()
with
propensity scores estimated previously without having to recompute them.
distance
can also be supplied as a matrix whose values represent the
pairwise distances between units. The matrix should either be a square, with
a row and column for each unit (e.g., as the output of a call to
as.matrix(
dist
(.))
), or have as many rows as there are treated
units and as many columns as there are control units (e.g., as the output of
a call to mahalanobis_dist()
or optmatch::match_on()
). Distance values
of Inf
will disallow the corresponding units to be matched. When
distance
is a supplied as a numeric vector or matrix, link
and
distance.options
are ignored.