Implements the methods described in King and Zeng (2006a, 2006b) for evaluating counterfactuals.
whatif(formula = NULL, data, cfact, range = NULL, freq = NULL, nearby = 1,
distance = "gower", miss = "list", choice = "both", return.inputs = FALSE,
return.distance = FALSE, mc.cores = detectCores(), ...)
An optional formula without a dependent variable that
is of class "formula" and that follows standard R
conventions for formulas, e.g. ~ x1 + x2. Allows you to
transform or otherwise re-specify combinations of the variables in
both data
and cfact
. To use this
parameter, both data
and cfact
must be coercable
to data frames; the variables of both data
and
cfact
must be labeled; and all variables appearing in
formula
must also appear in both data
and
cfact
. Otherwise, errors are returned. The intercept is
automatically dropped. Default is NULL
.
May take one of the following forms:
A R
model output object, such as the output from calls to
lm
, glm
, and zelig
. If it is not a zelig
object,
such an output object must be a list. It must additionally have either a
formula
or terms
component and either a data
or
model
component; if it
does not, an error is returned. Of the latter, whatif
first looks for data
, which should contain either the original
data set supplied as part of the model call (as in glm
)
or the name of this data set (as in zelig
), which is
assumed to reside in the global environment. If data
does not
exist, whatif
then looks for model
, which should
contain the model frame (as in lm
). The intercept is
automatically dropped from the extracted observed covariate data
set if the original model included one.
A \(n\)-by-\(k\) non-character (logical or numeric) matrix or
data frame of observed covariate data with \(n\) data points
or units and \(k\) covariates. All desired variable transformations
and interaction terms should be included in this set of \(k\)
covariates unless formula
is alternatively used to
produce them. However, an intercept should not be. Such a matrix
may be obtained by passing model output (e.g., output from a call
to lm
) to model.matrix
and excluding the intercept
from the resulting matrix if one was
fit. Note that whatif
will attempt to coerce data frames
to their internal numeric values. Hence, data frames should only
contain logical, numeric, and factor columns; character columns
will lead to an error being returned.
A string. Either the complete path (including file name) of the
file containing the data or the path relative to your working
directory. This file should be a white space delimited text file.
If it contains a header, you must include a column of row names as
discussed in the help file for the R
function
read.table
. The data in the file should be as otherwise
described in (2).
Missing data is allowed and will be dealt with
via the argument missing
. It should be flagged using
R
's standard representation for missing data, NA
.
A R
object or a string. If a R
object,
a \(m\)-by-\(k\) non-character matrix or data frame of
counterfactuals with \(m\) counterfactuals and the same \(k\)
covariates (in the same order) as in data
. However, if
formula
is used to select a subset of the \(k\) covariates,
then cfact
may contain either only these \(j \leq k\)
covariates or the complete set of \(k\) covariates. An intercept
should not be included as one of the covariates. It will be
automatically dropped from the counterfactuals generated by
Zelig if the original model contained one. Data frames
will again be coerced to their internal numeric values if possible.
If a string, either the complete path (including file name) of the
file containing the counterfactuals or the path relative to your
working directory. This file should be a white space delimited text
file. See the discussion under data
for instructions on
dealing with a header. All counterfactuals should be fully
observed: if you supply counterfactuals with missing data, they will
be list-wise deleted and a warning message will be printed to the screen.
An optional numeric vector of length \(k\), where \(k\) is
the number of covariates. Each element represents the range of the corresponding
covariate for use in calculating Gower distances. Use this argument
when covariate data do not represent the population of interest,
such as selection by stratification or experimental manipulation.
By default, the range of each covariate is calculated
from the data (the difference of its maximum and minimum values in
the sample), which is appropriate when a simple random sampling
design was used. To supply your own range for the \(k\)th covariate,
set the \(k\)th element of the vector equal to the desired range
and all other elements equal to NA
. Default is NULL
.
An optional numeric vector of any positive length, the elements
of which comprise a set of distances. Used in calculating
cumulative frequency distributions for the distances of the data
points from each counterfactual. For each such distance and
counterfactual, the cumulative frequency is the fraction of observed
covariate data points with distance to the counterfactual less
than or equal to the supplied distance value. The default varies
with the distance measure used. When the Gower distance measure is employed,
frequencies are calculated for the sequence of Gower distances from
0 to 1 in increments of 0.05. When the Euclidian distance measure
is employed, frequencies are calculated for the sequence of Euclidian
distances from the minimum to the maximum observed distances in twenty
equal increments, all rounded to two decimal places. Default is NULL
.
An optional scalar indicating
which observed data points are considered to be nearby (i.e., withing `nearby'
geometric variances of) the counterfactuals. Used to calculate the summary statistic
returned by the function: the fraction of the observed data nearby
each counterfactual. By default, the geometric variance of the
covariate data is used. For example, setting nearby
to
2 will identify the proportion of data points within two geometric variances of a
counterfactual. Default is NULL
.
An optional string indicating which of two distance measures
to employ. The choices are either "gower"
, Gower's non-parametric
distance measure (\(G^2\)), which is suitable for both qualitative
and quantitative data; or "euclidian"
, squared Euclidian distance, which
is only suitable for quantitative data. The default is "gower"
.
An optional string indicating the strategy for dealing
with missing data in the observed covariate data set.
whatif
supports two possible missing data strategies:
"list"
, list-wise deletion of missing cases; and "case"
,
ignoring missing data case-by-case. Note that if "case"
is
selected, cases with missing values are deleted listwise for the
convex hull test and for computing Euclidian distances, but pairwise deletion is
used in computing the Gower distances to maximally use available
information. The user is strongly encouraged to treat missing data
using specialized tools such as Amelia prior to feeding the data to
whatif
. Default is "list"
.
An optional string indicating which analyses to
undertake. The options are either "hull"
, only perform the convex hull
membership test; "distance"
, do not perform the convex
hull test but do everything else, such as calculating the distance between
each counterfactual and data point; or "both"
, undertake both the
convex hull test and the distance calculations (i.e., do everything).
Default is "both"
.
A Boolean; should the processed observed
covariate and counterfactual data matrices on which all
whatif
computations are performed be returned? Processing
refers to internal whatif
operations such as the subsetting
of covariates via formula
, the deletion of cases with
missing values, and the coercion of data frames to numeric matrices.
Primarily intended for diagnostic purposes. If TRUE
, these matrices
are returned as a list. Default is FALSE
.
A Boolean; should the matrix of distances
between each counterfactual and data point be returned? If
TRUE
, this matrix is returned as part of the output; if
FALSE
, it is not. Default is FALSE
due to the large
size that this matrix may attain.
The number of cores to use for the convex hull test, i.e. at
most how many child processes will be run simultaneously. Must be at least
one, and parallelization requires at least two cores. The default is set by
detectCores
Further arguments passed to and from other methods.
An object of class "whatif", a list consisting of the following six or seven elements:
The original call to whatif
.
A list with two elements, data
and cfact
. Only
present if return.inputs
was set equal to TRUE
in the call
to whatif
. The first element is the processed observed
covariate data matrix on which all whatif
computations were
performed. The second element is the processed counterfactual data
matrix.
A logical vector of length \(m\), where \(m\) is the number
of counterfactuals. Each element of the vector is TRUE
if the corresponding
counterfactual is in the convex hull and FALSE
otherwise.
A \(m\)-by-\(n\) numeric matrix, where \(m\) is
the number of counterfactuals and \(n\) is the number of data points
(units). Only present if return.distance
was set equal to TRUE
in the call to whatif
. The \([i, j]\)th entry of the matrix contains the
distance between the \(i\)th counterfactual and the \(j\)th data point.
A scalar. The geometric variability of the observed covariate data.
A numeric vector of length \(m\), where \(m\) is the
number of counterfactuals. The \(m\)th element contains the summary
statistic for the corresponding counterfactual. This summary statistic is
the fraction of data points with distances to the counterfactual
less than the argument nearby
, which by default is the geometric
variability of the covariates.
A numeric matrix. By default, the matrix has
dimension \(m\)-by-21, where \(m\) is the number of
counterfactuals; however, if you supplied your own frequencies via
the argument freq
, the matrix has dimension \(m\)-by-\(f\),
where \(f\) is the length of freq
. Each row of the
matrix contains the cumulative frequency distribution for the
corresponding counterfactual calculated using either the distance
measure-specific default set of distance values or the set that you supplied (see
the discussion under the argument freq
). Hence, the \([i, j]\)th
entry of the matrix is the fraction of data points with
distances to the \(i\)th counterfactual less than or equal to the
value represented by the \(j\)th column. The column names contain these
values.
This function is the primary tool for evaluating your counterfactuals. Specifically, it:
Determines whether or not your counterfactuals are in the convex hull of the observed covariate data.
Computes the distance of your counterfactuals from each of the \(n\) observed covariate data points. The default distance function used is Gower's non-parametric measure.
Computes a summary statistic for each counterfactual based on the distances in (2): the fraction of observed covariate data points with distances to your counterfactual less than a value you supply. By default, this value is taken to be the geometric variability of the observed data.
Computes the cumulative frequency distribution of each counterfactual for the distances in (2) using values that you supply. By default, Gower distances from 0 to 1 in increments of 0.05 are used.
King, Gary and Langche Zeng. 2006. "The Dangers of Extreme Counterfactuals." Political Analysis 14 (2). Available from https://gking.harvard.edu.
King, Gary and Langche Zeng. 2007. "When Can History Be Our Guide? The Pitfalls of Counterfactual Inference." International Studies Quarterly 51 (March). Available from https://gking.harvard.edu.
plot.whatif
,
summary.whatif
,
print.whatif
,
print.summary.whatif
# NOT RUN {
## Create example data sets and counterfactuals
my.cfact <- matrix(rnorm(3*5), ncol = 5)
my.data <- matrix(rnorm(100*5), ncol = 5)
## Evaluate counterfactuals
my.result <- whatif(data = my.data, cfact = my.cfact, mc.cores = 1)
## Evaluate counterfactuals and supply own gower distances for
## cumulative frequency distributions
my.result <- whatif(cfact = my.cfact, data = my.data,
freq = c(0, .25, .5, 1, 1.25, 1.5), mc.cores = 1)
# }
Run the code above in your browser using DataLab