Block units into experimental blocks, with one unit per treatment condition. Blocking begins by creating a measure of multivariate distance between all possible pairs of units. Maximum, minimum, or an allowable range of differences between units on one variable can be set.
block(
data,
vcov.data = NULL,
groups = NULL,
n.tr = 2,
id.vars,
block.vars = NULL,
algorithm = "optGreedy",
distance = "mahalanobis",
weight = NULL,
optfactor = 10^7,
row.sort = NULL,
level.two = FALSE,
valid.var = NULL,
valid.range = NULL,
seed.dist,
namesCol = NULL,
verbose = FALSE,
...
)
A list with elements
blocks: a list of dataframes, each containing a group's blocked units. If there are two treatment conditions, then the last column of each dataframe displays the multivariate distance between the two units. If there are more than two treatment conditions, then the last column of each dataframe displays the largest of the multivariate distances between all possible pairs in the block.
level.two: a logical indicating whether level.two = TRUE
.
call: the original call to block
.
a dataframe or matrix, with units in rows and variables in columns.
an optional matrix of data used to estimate the variance-covariance matrix for calculating multivariate distance.
an optional column name from data
, specifying subgroups within which blocking occurs.
the number of treatment conditions per block.
a required string or vector of two strings specifying which column(s) of data
contain identifying information.
an optional string or vector of strings specifying which column(s) of data
contain the numeric blocking variables.
a string specifying the blocking algorithm. "optGreedy"
, "optimal"
, "naiveGreedy"
, "randGreedy"
, and "sortGreedy"
algorithms are currently available. See Details for more information.
either a) a string defining how the multivariate distance used for blocking is calculated (options include "mahalanobis"
, "mcd"
, "mve"
, and "euclidean"
), or b) a user-defined $k$-by-$k$ matrix of distances, where $k$ is the number of rows in data
.
either a vector of length equal to the number of blocking variables or a square matrix with dimensions equal to the number of blocking variables used to explicitly weight blocking variables.
a number by which distances are multiplied then divided when algorithm = "optimal"
.
an optional vector of integers from 1 to nrow(data)
used to sort the rows of data when algorithm = "sortGreedy"
.
a logical defining the level of blocking.
an optional string defining a variable on which units in the same block must fall within the range defined by valid.range
.
an optional vector defining the range of valid.var
within which units in the same block must fall.
an optional integer value for the random seed set in cov.rob
, used to calculate measures of the variance-covariance matrix robust to outliers.
an optional vector of column names for the output table.
a logical specifying whether groups
names and block numbers are printed as blocks are created.
additional arguments passed to cov.rob
.
Ryan T. Moore rtm@american.edu and Keith Schnakenberg keith.schnakenberg@gmail.com
If vcov.data = NULL
, then block
calculates the variance-covariance matrix using the block.vars
from data
.
If groups
is not user-specified, block
temporarily creates a variable in data
called groups
, which takes the value 1 for every unit.
Where possible, one unit is assigned to each condition in each block. If there are fewer available units than treatment conditions, available units are used.
If n.tr
$> 2$, then the optGreedy
algorithm finds the best possible pair match, then the best match to either member of the pair, then the best match to any member of the triple, etc. After finding the best pair match to a given unit, the other greedy algorithms proceed by finding the third, fourth, etc. best match to that given unit.
An example of id.vars
is id.vars = c("id", "id2")
. If two-level blocking is selected, id.vars
should be ordered (unit id, subunit id). See details for level.two
below for more information.
If block.vars = NULL
, then all variables in data
except the id.vars
are taken as blocking variables. E.g., block.vars = c("b1", "b2")
.
The algorithm optGreedy
calls an optimal-greedy algorithm, repeatedly finding the best remaining match in the entire dataset; optimal
finds the set of blocks that minimizes the sum of the distances in all blocks; naiveGreedy
finds the best match proceeding down the dataset from the first unit to the last; randGreedy
randomly selects a unit, finds its best match, and repeats; sortGreedy
resorts the dataset according to row.sort
, then implements the naiveGreedy
algorithm.
The optGreedy
algorithm breaks ties by randomly selecting one of the minimum-distance pairs. The naiveGreedy
, sortGreedy
, and randGreedy
algorithms break ties by randomly selecting one of the minimum-distance matches to the particular unit in question.
As of version 0.5-1, blocking is done in C for all algorithms except optimal
(see following paragraphs for more details on the optimal
algorithm implementation).
The optimal
algorithm uses two functions from the nbpMatching package: distancematrix
prepares a distance matrix for optimal blocking, and nonbimatch
performs the optimal blocking by minimizing the sum of distances in blocks. nonbimatch
, and thus the block
algorithm optimal
, requires that n.tr = 2
.
Because distancematrix
takes the integer floor
of the distances, and one may want much finer precision, the multivariate distances calculated within block
are multiplied by optfactor
prior to optimal blocking. Then distancematrix
prepares the resulting distance matrix, and nonbimatch
is called on the output. The distances are then untransformed by dividing by optfactor
before being returned by block
.
The choice of optfactor
can determine whether the Fortran code can allocate enough memory to solve the optimization problem. For example, blocking the first 14 units of x100
by executing block(x100[1:14, ], id.vars = "id", block.vars = c("b1", "b2"), algorithm = "optimal", optfactor = 10^8)
fails for Fortran memory reasons, while the same code with optfactor = 10^5
runs successfully. Smaller values of optfactor
imply easier computation, but less precision.
Most of the algorithms in block
make prohibited blockings by using a distance of Inf
. However, the optimal algorithm calls Fortran
code from nbpMatching and requires integers. Thus, a distance of 99999 * max(dist.mat)
is used to effectively prohibit blockings. This follows the procedure demonstrated in the example of help(nonbimatch)
.
In order to enable comparisons of block-quality across groups, when distance
is a string, $Sigma$ is calculated using units from all groups.
The distance = "mcd"
and distance = "mve"
options call cov.rob
to calculate measures of multivariate spread robust to outliers. The distance = "mcd"
option calculates the Minimum Covariance Determinant estimate (Rousseeuw 1985); the distance = "mve"
option calculates the Minimum Volume Ellipsoid estimate (Rousseeuw and van Zomeren 1990). When distance = "mcd"
, the interquartile range on blocking variables should not be zero.
A user-specified distance matrix must have diagonals equal to 0, indicating zero distance between a unit and itself. Only the lower triangle of the matrix is used.
If weight
is a vector, then it is used as the diagonal of a square weighting matrix with non-diagonal elements equal to zero. The weighting is done by using as the Mahalanobis distance scaling matrix $((((chol(Sigma))')^{-1})'W((chol(Sigma))')^{-1})^{-1}$, where $chol(Sigma)$ is the Cholesky decomposition of the usual variance-covariance matrix and $W$ is the weighting matrix. Differences should be smaller on covariates given higher weights.
If level.two = TRUE
, then the best subunit block-matches in different units are found. E.g., provinces could be matched based on the most similar cities within them. All subunits in the data should have unique names. Thus, if subunits are numbered 1 to (number of subunits in unit) within each unit, then they should be renumbered, e.g., 1 to (total number of subunits in all units). level.two
blocking is not currently implemented for algorithm = "optimal"
. Units with no blocked subunit are put into their own blocks. However, unblocked subunits within a unit that does have a blocked subunit are not put into their own blocks.
An example of a variable restriction is valid.var = "b2"
, valid.range = c(10,50)
, which requires that units in the same block be at least 10 units apart, but no more than 50 units apart, on variable "b2"
. As of version 0.5-3, variable restrictions are implemented in all algorithms except optimal
. Note that employing a variable restriction may result in fewer than the maximum possible number of blocks. See https://www.ryantmoore.org/html/software.blockTools.html for details.
If namesCol = NULL
, then “Unit 1”, “Unit 2”, ... are used. If level.two = FALSE
, then namesCol
should be of length n.tr
; if level.two = TRUE
, then namesCol
should be of length 2*n.tr
, and in the order shown in the example below.
King, Gary, Emmanuela Gakidou, Nirmala Ravishankar, Ryan T. Moore, Jason Lakin, Manett Vargas, Martha Mar\'ia T\'ellez-Rojo and Juan Eugenio Hern\'andez \'Avila and Mauricio Hern\'andez \'Avila and H\'ector Hern\'andez Llamas. 2007. "A 'Politically Robust' Experimental Design for Public Policy Evaluation, with Application to the Mexican Universal Health Insurance Program". Journal of Policy Analysis and Management 26(3): 479-509.
Moore, Ryan T. 2012. "Multivariate Continuous Blocking to Improve Political Science Experiments." Political Analysis 20(4):460-479.
Rousseeuw, Peter J. 1985. "Multivariate Estimation with High Breakdown Point". Mathematical Statistics and Applications 8:283-297.
Rousseeuw, Peter J. and Bert C. van Zomeren. 1990. "Unmasking Multivariate Outliers and Leverage Points". Journal of the American Statistical Association 85(411):633-639.
assignment
, diagnose
data(x100)
out <- block(x100, groups = "g", n.tr = 2, id.vars = c("id"),
block.vars = c("b1", "b2"), algorithm = "optGreedy",
distance = "mahalanobis", level.two = FALSE, valid.var = "b1",
valid.range = c(0, 500), verbose = TRUE)
# out$blocks contains 3 data frames
# To illustrate two-level blocking, with multiple level two units per level one unit:
for(i in (1:nrow(x100))){if((i %% 2) == 0){x100$id[i] <- x100$id[i-1]}}
out2 <- block(x100, groups = "g", n.tr = 2, id.vars = c("id", "id2"),
block.vars = c("b1", "b2"), algorithm = "optGreedy",
distance = "mahalanobis", level.two = TRUE, valid.var = "b1",
valid.range = c(0,500), namesCol = c("State 1", "City 1",
"State 2", "City 2"), verbose = TRUE)
Run the code above in your browser using DataLab