method_cem: Coarsened Exact Matching

Description

In matchit, setting method = "cem" performs coarsened exact matching. With coarsened exact matching, covariates are coarsened into bins, and a complete cross of the coarsened covariates is used to form subclasses defined by each combination of the coarsened covariate levels. Any subclass that doesn't contain both treated and control units is discarded, leaving only subclasses containing treatment and control units that are exactly equal on the coarsened covariates. The coarsening process can be controlled by an algorithm or by manually specifying cutpoints and groupings. The benefits of coarsened exact matching are that the tradeoff between exact matching and approximate balancing can be managed to prevent discarding too many units, which can otherwise occur with exact matching.

This page details the allowable arguments with method = "cem". See matchit for an explanation of what each argument means in a general context and how it can be specified.

Below is how matchit is used for coarsened exact matching:

matchit(formula, data = NULL, method = "cem",
        estimand = "ATT", s.weights = NULL,
        verbose = FALSE, ...)

Arguments

formula

a two-sided formula object containing the treatment and covariates to be used in creating the subclasses defined by a full cross of the coarsened covariate levels.

data

a data frame containing the variables named in formula. If not found in data, the variables will be sought in the environment.

method

set here to "cem".

estimand

a string containing the desired estimand. Allowable options include "ATT", "ATC", and "ATE". The estimand controls how the weights are computed; see the Computing Weights section at matchit for details. When k2k = TRUE (see below), estimand also controls how the matching is done.

s.weights

the variable containing sampling weights to be incorporated into balance statistics. These weights do not affect the matching process.

verbose

logical; whether information about the matching process should be printed to the console.

…

additional arguments to control the matching process.

grouping: a named list with an (optional) entry for each categorical variable to be matched on. Each element should itself be a list, and each entry of the sublist should be a vector containing levels of the variable that should be combined to form a single level. Any categorical variables not included in grouping will remain as they are in the data, which means exact matching, with no coarsening, will take place on these variables. See Details.
cutpoints: a named list with an (optional) entry for each numeric variable to be matched on. Each element describes a way of coarsening the corresponding variable. They can be a vector of cutpoints that demarcate bins, a single number giving the number of bins, or a string corresponding to a method of computing the number of bins. Allowable strings include "sturges", "scott", and "fd", which use the functions nclass.Sturges, nclass.scott, and nclass.FD, respectively. The default is "sturges" for variables that are not listed or if no argument is supplied. Can also be a single value to be applied to all numeric variables. See Details.
k2k: codelogical; whether 1:1 matching should occur within the matched strata. If TRUE nearest neighbor matching without replacement will take place within each stratum, and any unmatched units will be dropped (e.g., if there are more treated than control units in the stratum, the treated units without a match will be dropped). The k2k.method argument controls how the distance between units is calculated.
k2k.method: character; how the distance between units should be calculated if k2k = TRUE. Allowable arguments include NULL (for random matching), "mahalanobis" (for Mahalanobis distance matching), or any allowable argument to method in dist. Matching will take place on scaled versions of the original (non-coarsened) variables. The default is "mahalanobis".
mpower: if k2k.method = "minkowski", the power used in creating the distance. This is passed to the p argument of dist.

Outputs

All outputs described in matchit are returned with method = "cem" except for match.matrix. When k2k = TRUE, a match.matrix component with the matched pairs is also included.

Details

If the coarsening is such that there are no exact matches with the coarsened variables, the grouping and cutpoints arguments can be used to modify the matching specification. Reducing the number of cutpoints or grouping some variable values together can make it easier to find matches. See Examples below. Removing variables can also help (but they will likely not be balanced unless highly correlated with the included variables). To take advantage of coarsened exact matching without failing to find any matches, the covariates can be manually coarsened outside of matchit() and then supplied to the exact argument in a call to matchit() with another matching method.

Setting k2k = TRUE is equivalent to matching with k2k = FALSE and then supplying stratum membership as an exact matching variable (i.e., in exact) to another call to matchit() with method = "nearest", distance = "mahalanobis" and an argument to discard denoting unmatched units. It is also equivalent to performing nearest neighbor matching supplying coarsened versions of the variables to exact, except that method = "cem" automatically coarsens the continuous variables. The estimand argument supplied with method = "cem" functions the same way it would in these alternate matching calls, i.e., by determining the "focal" group that controls the order of the matching.

Grouping and Cutpoints

The grouping and cutpoints arguments allow one to fine-tune the coarsening of the covariates. grouping is used for combining categories of categorical covariates and cutpoints is used for binning numeric covariates. The values supplied to these arguments should be iteratively changed until a matching solution that balances covariate balance and remaining sample size is obtained. The arguments are described below.

The argument to grouping must be a list, where each component has the name of a categorical variable, the levels of which are to be combined. Each component must itself be a list; this list contains one or more vectors of levels, where each vector corresponds to the levels that should be combined into a single category. For example, if a variable amount had levels "none", "some", and "a lot", one could enter grouping = list(amount = list(c("none"), c("some", "a lot"))), which would group "some" and "a lot" into a single category and leave "none" in its own category. Any levels left out of the list for each variable will be left alone (so c("none") could have been omitted from the previous code). Note that if a categorical variable does not appear in grouping, it will not be coarsened, so exact matching will take place on it. grouping should not be used for numeric variables; use cutpoints, described below, instead.

The argument to cutpoints must also be a list, where each component has the name of a numeric variables that is to be binned. (As a shortcut, it can also be a single value that will be applied to all numeric variables). Each component can take one of three forms: a vector of cutpoints that separate the bins, a single number giving the number of bins, or a string corresponding to an algorithm used to compute the number of bins. Any values at a boundary will be placed into the higher bin; e.g., if the cutpoints were (c(0, 5, 10)), values of 5 would be placed into the same bin as values of 6, 7, 8, or 9, and values of 10 would be placed into a different bin. Internally, values of -Inf and Inf are appended to the beginning and end of the range. When given as a single number defining the number of bins, the bin boundaries are the maximum and minimum values of the variable with bin boundaries evenly spaced between them, i.e., not quantiles. A value of 0 will not perform any binning (equivalent to exact matching on the variable), and a value of 1 will remove the variable from the exact matching variables but it will be still used for pair matching when k2k = TRUE. The allowable strings include "sturges", "scott", and "fd", which use the corresponding binning method, and "q#" where # is a number, which splits the variable into # equally-sized bins (i.e., quantiles).

An example of a way to supply an argument to cutpoints would be the following:

cutpoints = list(X1 = 4,
                 X2 = c(1.7, 5.5, 10.2),
                 X3 = "scott",
                 X4 = "q5")

This would split X1 into 4 bins, X2 into bins based on the provided boundaries, X3 into a number of bins determined by nclass.scott, and X4 into quintiles. All other numeric variables would be split into a number of bins determined by nclass.Sturges, the default.

References

In a manuscript, you don't need to cite another package when using method = "cem" because the matching is performed completely within MatchIt. For example, a sentence might read:

Coarsened exact matching was performed using the MatchIt package (Ho, Imai, King, & Stuart, 2011) in R.

It would be a good idea to cite the following article, which develops the theory behind coarsened exact matching:

Iacus, S. M., King, G., & Porro, G. (2012). Causal Inference without Balance Checking: Coarsened Exact Matching. Political Analysis, 20(1), 1<U+2013>24. 10.1093/pan/mpr013

Examples

Run this code

# NOT RUN {
data("lalonde")

# Coarsened exact matching on age, race, married, and educ with educ
# coarsened into 5 bins and race coarsened into 2 categories,
# grouping "white" and "hispan" together
m.out1 <- matchit(treat ~ age + race + married + educ, data = lalonde,
                  method = "cem", cutpoints = list(educ = 5),
                  grouping = list(race = list(c("white", "hispan"),
                                              c("black"))))
m.out1
summary(m.out1)

# The same but requesting 1:1 Mahalanobis distance matching with
# the k2k and k2k.method argument. Note the remaining number of units
# is smaller than when retaining the full matched sample.
m.out2 <- matchit(treat ~ age + race + married + educ, data = lalonde,
                  method = "cem", cutpoints = list(educ = 5),
                  grouping = list(race = list(c("white", "hispan"),
                                              "black")),
                  k2k = TRUE, k2k.method = "mahalanobis")
m.out2
summary(m.out2)
# }

Run the code above in your browser using DataLab