GenMatch: Genetic Matching

Description

This function finds optimal balance using multivariate matching where a genetic search algorithm determines the weight each covariate is given. Balance is determined by examining cumulative probability distribution functions of a variety of standardized statistics. By default, these statistics include t-tests and Kolmogorov-Smirnov tests. A variety of descriptive statistics based on empirical-QQ (eQQ) plots can also be used or any user provided measure of balance. The statistics are not used to conduct formal hypothesis tests, because no measure of balance is a monotonic function of bias and because balance should be maximized without limit. The object returned by GenMatch can be supplied to the Match function (via the Weight.matrix option) to obtain causal estimates. GenMatch uses genoud to perform the genetic search. Using the cluster option, one may use multiple computers, CPUs or cores to perform parallel computations.

Usage

GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1, weights=NULL,
         pop.size = 100, max.generations=100,
         wait.generations=4, hard.generation.limit=FALSE,
         starting.values=rep(1,ncol(X)),
         fit.func="pvals",
         MemoryMatrix=TRUE,
         exact=NULL, caliper=NULL, replace=TRUE, ties=TRUE,
         CommonSupport=FALSE, nboots=0, ks=TRUE, verbose=FALSE,
         distance.tolerance=1e-05,
         tolerance=sqrt(.Machine$double.eps),
         min.weight=0, max.weight=1000,
         Domains=NULL, print.level=2,
         project.path=NULL,
         paired=TRUE, loss=1,
         data.type.integer=FALSE,
         restrict=NULL,
         cluster=FALSE, balance=TRUE, ...)

Arguments

A vector indicating the observations which are in the treatment regime and those which are not. This can either be a logical vector or a real vector where 0 denotes control and 1 denotes treatment.

A matrix containing the variables we wish to match on. This matrix may contain the actual observed covariates or the propensity score or a combination of both.

BalanceMatrix

A matrix containing the variables we wish to achieve balance on. This is by default equal to X, but it can in principle be a matrix which contains more or less variables than X or variables which are transformed in v

estimand

A character string for the estimand. The default estimand is "ATT", the sample average treatment effect for the treated. "ATE" is the sample average treatment effect, and "ATC" is the sample average treatment effect for the controls.

A scalar for the number of matches which should be found. The default is one-to-one matching. Also see the ties option.

weights

A vector the same length as Y which provides observation specific weights.

pop.size

Population Size. This is the number of individuals genoud uses to solve the optimization problem. The theorems proving that genetic algorithms find good solutions are asymptotic in popula

max.generations

Maximum Generations. This is the maximum number of generations that genoud will run when optimizing. This is a soft limit. The maximum generation limit will be binding only if

wait.generations

If there is no improvement in the objective function in this number of generations, optimization will stop. The other options controlling termination are max.generations and hard.generation.limit.

hard.generation.limit

This logical variable determines if the max.generations variable is a binding constraint. If hard.generation.limit is FALSE, then the algorithm may exceed the max.generations count if the ob

starting.values

This vector's length is equal to the number of variables in X. This vector contains the starting weights each of the variables is given. The starting.values vector is a way for the user to insert one individ

fit.func

The balance metric GenMatch should optimize. The user may choose from the following or provide a function: pvals: maximize the p.values from (paired) t-tests and Kolmogorov-Smirnov tests conducted for each column in <

MemoryMatrix

This variable controls if genoud sets up a memory matrix. Such a matrix ensures that genoud will request the fitness evaluation of a giv

exact

A logical scalar or vector for whether exact matching should be done. If a logical scalar is provided, that logical value is applied to all covariates in X. If a logical vector is provided, a logical value should be provided

caliper

A scalar or vector denoting the caliper(s) which should be used when matching. A caliper is the distance which is acceptable for any match. Observations which are outside of the caliper are dropped. If a scalar caliper is provided, this cali

replace

A logical flag for whether matching should be done with replacement. Note that if FALSE, the order of matches generally matters. Matches will be found in the same order as the data are sorted. Thus, the match(es) for the first

ties

A logical flag for whether ties should be handled deterministically. By default ties==TRUE. If, for example, one treated observation matches more than one control observation, the matched dataset will include the multiple matched

CommonSupport

This logical flag implements the usual procedure by which observations outside of the common support of a variable (usually the propensity score) across treatment and control groups are discarded. The caliper option is to be

nboots

The number of bootstrap samples to be run for the ks test. By default this option is set to zero so no bootstraps are done. See ks.boot for additional details.

A logical flag for if the univariate bootstrap Kolmogorov-Smirnov (KS) test should be calculated. If the ks option is set to true, the univariate KS test is calculated for all non-dichotomous variables. The bootstrap KS test is consistent ev

verbose

A logical flag for whether details of each fitness evaluation should be printed. Verbose is set to FALSE if the cluster option is used.

distance.tolerance

This is a scalar which is used to determine if distances between two observations are different from zero. Values less than distance.tolerance are deemed to be equal to zero. This option can be used to perform a type of optimal m

tolerance

This is a scalar which is used to determine numerical tolerances. This option is used by numerical routines such as those used to determine if a matrix is singular.

min.weight

This is the minimum weight any variable may be given.

max.weight

This is the maximum weight any variable may be given.

Domains

This is a ncol(X) $\times 2$ matrix. The first column is the lower bound, and the second column is the upper bound for each variable over which genoud will search for weights.

print.level

This option controls the level of printing. There are four possible levels: 0 (minimal printing), 1 (normal), 2 (detailed), and 3 (debug). If level 2 is selected, GenMatch will print details about the population at each generati

project.path

This is the path of the genoud project file. By default no file is produced unless print.level=3. In that case, genoud

paired

A flag for whether the paired t.test should be used when determining balance.

loss

The loss function to be optimized. The default value, 1, implies "lexical" optimization: all of the balance statistics will be sorted from the most discrepant to the least and weights will be picked which minimize the maximum dis

data.type.integer

By default, floating-point weights are considered. If this option is set to TRUE, search will be done over integer weights. Note that before version 4.1, the default was to use integer weights.

restrict

A matrix which restricts the possible matches. This matrix has one row for each restriction and three columns. The first two columns contain the two observation numbers which are to be restricted (for example 4 and 20), and the third colu

cluster

This can either be an object of the 'cluster' class returned by one of the makeCluster commands in the snow package or a vector of machine names so that GenMatch can setup t

balance

This logical flag controls if load balancing is done across the cluster. Load balancing can result in better cluster utilization; however, increased communication can reduce performance. This option is best used if each individual call to

...

Other options which are passed on to genoud.

Value

valueThe fit values at the solution. By default, this is a vector of p-values sorted from the smallest to the largest. There will generally be twice as many p-values as there are variables in BalanceMatrix, unless there are dichotomous variables in this matrix. There is one p-value for each covariate in BalanceMatrix which is the result of a paired t-test and another p-value for each non-dichotomous variable in BalanceMatrix which is the result of a Kolmogorov-Smirnov test. Recall that these p-values cannot be interpreted as hypothesis tests. They are simply measures of balance.
parA vector of the weights given to each variable in X.
Weight.matrixA matrix whose diagonal corresponds to the weight given to each variable in X. This object corresponds to the Weight.matrix in the Match function.
matchesA matrix where the first column contains the row numbers of the treated observations in the matched dataset. The second column contains the row numbers of the control observations. And the third column contains the weight that each matched pair is given. These columns correspond respectively to the index.treated, index.control and weights objects which are returned by Match.
ecaliperThe size of the enforced caliper on the scale of the X variables. This object has the same length as the number of covariates in X.

References

Sekhon, Jasjeet S. 2007. ``Multivariate and Propensity Score Matching Software with Automated Balance Optimization.'' Journal of Statistical Software. http://sekhon.berkeley.edu/papers/MatchingJSS.pdf Sekhon, Jasjeet S. 2006. ``Alternative Balance Metrics for Bias Reduction in Matching Methods for Causal Inference.'' Working Paper. http://sekhon.berkeley.edu/papers/SekhonBalanceMetrics.pdf

Diamond, Alexis and Jasjeet S. Sekhon. 2005. ``Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies.'' Working Paper. http://sekhon.berkeley.edu/papers/GenMatch.pdf

Sekhon, Jasjeet Singh and Walter R. Mebane, Jr. 1998. ``Genetic Optimization Using Derivatives: Theory and Application to Nonlinear Models.'' Political Analysis, 7: 187-210. http://sekhon.berkeley.edu/genoud/genoud.pdf

Examples

Run this code

data(lalonde)
attach(lalonde)

#The covariates we want to match on
X = cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74)

#The covariates we want to obtain balance on
BalanceMat <- cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74,
                    I(re74*re75))

#
#Let's call GenMatch() to find the optimal weight to give each
#covariate in 'X' so as we have achieved balance on the covariates in
#'BalanceMat'. This is only an example so we want GenMatch to be quick
#so the population size has been set to be only 16 via the 'pop.size'
#option. This is *WAY* too small for actual problems.
#For details see http://sekhon.berkeley.edu/papers/MatchingJSS.pdf.
#
genout <- GenMatch(Tr=treat, X=X, BalanceMatrix=BalanceMat, estimand="ATE", M=1,
                   pop.size=16, max.generations=10, wait.generations=1)

#The outcome variable
Y=re78/1000

#
# Now that GenMatch() has found the optimal weights, let's estimate
# our causal effect of interest using those weights
#
mout <- Match(Y=Y, Tr=treat, X=X, estimand="ATE", Weight.matrix=genout)
summary(mout)

#                        
#Let's determine if balance has actually been obtained on the variables of interest
#                        
mb <- MatchBalance(treat~age +educ+black+ hisp+ married+ nodegr+ u74+ u75+
                   re75+ re74+ I(re74*re75),
                   match.out=mout, nboots=500, ks=TRUE, mv=FALSE)

# For more examples see: http://sekhon.berkeley.edu/matching/R.

Run the code above in your browser using DataLab