The Euclidean distance (computed using euclidean_dist()
) is
the raw distance between units, computed as $$d_{ij} = \sqrt{(x_i -
x_j)(x_i - x_j)'}$$ where \(x_i\) and \(x_j\) are vectors of covariates
for units \(i\) and \(j\), respectively. The Euclidean distance is
sensitive to the scales of the variables and their redundancy (i.e.,
correlation). It should probably not be used for matching unless all of the
variables have been previously scaled appropriately or are already on the
same scale. It forms the basis of the other distance measures.
The scaled Euclidean distance (computed using
scaled_euclidean_dist()
) is the Euclidean distance computed on the
scaled covariates. Typically the covariates are scaled by dividing by their
standard deviations, but any scaling factor can be supplied using the
var
argument. This leads to a distance measure computed as
$$d_{ij} = \sqrt{(x_i - x_j)S_d^{-1}(x_i - x_j)'}$$ where \(S_d\) is a
diagonal matrix with the squared scaling factors on the diagonal. Although
this measure is not sensitive to the scales of the variables (because they
are all placed on the same scale), it is still sensitive to redundancy among
the variables. For example, if 5 variables measure approximately the same
construct (i.e., are highly correlated) and 1 variable measures another
construct, the first construct will have 5 times as much influence on the
distance between units as the second construct. The Mahalanobis distance
attempts to address this issue.
The Mahalanobis distance (computed using mahalanobis_dist()
)
is computed as $$d_{ij} = \sqrt{(x_i - x_j)S^{-1}(x_i - x_j)'}$$ where
\(S\) is a scaling matrix, typically the covariance matrix of the
covariates. It is essentially equivalent to the Euclidean distance computed
on the scaled principal components of the covariates. This is the most
popular distance matrix for matching because it is not sensitive to the
scale of the covariates and accounts for redundancy between them. The
scaling matrix can also be supplied using the var
argument.
The Mahalanobis distance can be sensitive to outliers and long-tailed or
otherwise non-normally distributed covariates and may not perform well with
categorical variables due to prioritizing rare categories over common ones.
One solution is the rank-based robust Mahalanobis distance
(computed using robust_mahalanobis_dist()
), which is computed by
first replacing the covariates with their ranks (using average ranks for
ties) and rescaling each ranked covariate by a constant scaling factor
before computing the usual Mahalanobis distance on the rescaled ranks.
The Mahalanobis distance and its robust variant are computed internally by
transforming the covariates in such a way that the Euclidean distance
computed on the scaled covariates is equal to the requested distance. For
the Mahalanobis distance, this involves replacing the covariates vector
\(x_i\) with \(x_iS^{-.5}\), where \(S^{-.5}\) is the Cholesky
decomposition of the (generalized) inverse of the covariance matrix \(S\).
When a left-hand-side splitting variable is present in formula
and
var = NULL
(i.e., so that the scaling matrix is computed internally),
the covariance matrix used is the "pooled" covariance matrix, which
essentially is a weighted average of the covariance matrices computed
separately within each level of the splitting variable to capture
within-group variation and reduce sensitivity to covariate imbalance. This
is also true of the scaling factors used in the scaled Euclidean distance.