gendata.ep: Function To Simulate Ecological and Survey Data For Use in Testing And Analyzing Other Functions in Package

Description

This function generates simulated ecological data, i.e., data in the form of contigency tables in which the row and column totals but none of the internal cell counts are observed. At the user's option, data from simulated surveys of some of the `units' (in voting parlance, 'precincts') that gave rise to the contingency tables are also produced.

Usage

gendata.ep(nprecincts = 175,
           nrowcat = 3,
           ncolcat = 3,
           colcatnames = c("Dem", "Rep", "Abs"),
           mu0 = c(-.6, -2.05, -1.7, -.2, -1.45, -1.45),
           rowcatnames = c("bla", "whi", "his", "asi"),
           alpha = c(.35, .45, .2, .1),
           housing.seg = 1,
           nprecincts.ep = 40,
           samplefrac.ep = 1/14,
           K0 = NULL,
           nu0 = 12,
           Psi0 = NULL,
           lambda = 1000,
           dispersion.low.lim = 1,
           dispersion.up.lim = 1,
           outfile=NULL,
           his.agg.bias.vec = c(0,0),
           HerfInvexp = 3.5,
           HerfNoInvexp = 3.5,
           HerfReasexp = 2)

Arguments

nprecincts

positive integer: The number of contingency tables (precincts) in the simulated dataset.

nrowcat

integer > 1: The number of rows in each of the contingency tables.

ncolcat

integer > 1: The number of columns in each of the contingency tables.

rowcatnames

string of length = length(nrowcat): Names of rows in each contingency table.

colcatnames

string of length = length(ncolcat): Names of columns in each contingency table.

alpha

vector of length(nrowcat): initial parameters to a Dirichlet distribution used to generate each contingency table's row fractions.

housing.seg

scalar > 0: multiplied to alpha to generate final parameters to Dirichlet distribution used to generate each contingency table's row fractions.

mu0

vector of length (nrowcat * (ncolcat - 1)): The mean of the multivariate normal hyperprior at the top level of the hierarchical model from which the data are simulated. See Details.

square matrix of dimension (nrowcat * (ncolcat - 1)): the covariance matrix of the multivariate normal hyperprior at the top level of the hierarchical model from which the data are simulated. See Details.

nu0

scalar > 0: the degrees of freedom for the Inv-Wishart hyperprior from which the \(SIGMA\) matrix will be drawn.

Psi0

square matrix of dimension (nrowcat * (ncolcat - 1)): scale matrix for the Inv-Wishart hyperprior from which the SIGMA matrix will be drawn.

lambda

scalar > 0: initial parameter of the Poisson distribution from which the number of voters in each precinct will be drawn

dispersion.low.lim

scalar > 0 but < dispersion.up.lim: lower limit of a draw from runif() to be multiplied to lambda to set a lower limit on the parameter used to draw from the Poisson distribution that determines the number of voters in each precinct.

dispersion.up.lim

scalar > dispersion.low.lim: upper limit of a draw from runif() to be multiplied to lambda to set a upper limit on the parameter used to draw from the Poisson distribution that determines the number of voters in each precinct.

outfile

string ending in ".Rdata": filepath and name of object; if non-NULL, the object returned by this function will be saved to the location specified by outfile.

his.agg.bias.vec

vector of length 2: only implemented for nowcat = 3 and ncolcat = 3: if non-null, induces aggregation bias into the simulated data. See Details.

nprecincts.ep

integer > -1 and less than nprecincts: number of contingency tables (precincts) to be included in simulated survey sample (ep for "exit poll").

samplefrac.ep

fraction (real number between 0 and 1): percentage of individual units (voters) within each contingency table (precinct) include in the survey sample.

HerfInvexp

scalar: exponent used to generate inverted quasi-Herfindahl weights used to sample contingency tables (precincts) for inclusion in a sample survey. See Details.

HerfNoInvexp

scalar: same as HerInvexp except the quasi-Herfindahl weights are not inverted. See Details.

HerfReasexp

scalar: same as HerfInvexp, for a separate sample survey. See Details.

Value

A list with the follwing elements.

GQdata

Matrix of dimension nprecincts by (nrowcat + ncolcat): The simulated (observed) ecological data, meaning the row and column totals in the contingency tables. May be passed as data argument in Tune, Analyze, TuneWithExitPoll, and AnalyzeWithExitPoll

EPInv

List of length 3: returnmat.ep, the first element in the list, is a matrix that may be passed as the exitpoll argument in TuneWithExitPoll and AnalyzeWithExitPoll. See Details. ObsData is a dataframe that may be used as the data argument in the survey package. sampprecincts.ep is a vector detailing the row numbers of GQdata (meaning the contingency tables) that were included in the EPInv survey (exit poll). See Details for an explanation of the weights used to select the contingency tables for inclusion in the EPInv survey (exit poll).

EPNoInv

List of length 3: Contains the same elements as EPInv. See Details for an explanation of weights used to select the contingency tables for inclusion in the EPNoInv survey (exit poll).

EPReas

List of length 3: Contains the same elements as EPInv. See Details for an explanation of weights used to select the contingency tables for inclusion in the EPReas survey (exit poll).

omega.matrix

Matrix of dimension nprecincts by (nrowcat * (ncolcat-1)): The matrix of draws from the multivariate normal distribution at the second level of the hiearchical model giving rise to GQdata. These values undergo an inverse-stacked-multidimensional logistic transformation to produce contingency table row probability vectors.

interior.tables

List of length nprecincts: Each element of the list is a full (meaning all interior cells are filled in) contingency table.

vector of length nrowcat * (ncolcat-1): the \(mu\) vector drawn at the top level of the hierarchical model giving rise to GQdata. See Details.

Sigma

square matrix of dimension nrowcat * (ncolcat-1): the covariance matrix drawn at the top level of the hierarchical model giving rise to GQdata. See Details.

Sigma.diag

the output of diag(Sigma).

Sigma.corr

the output of cov2cor(Sigma).

sim.check.vec

vector: the true values of the parameters generated by Analyze and AnalyzeWithExitPoll in the same order as the parameters are produced by those two functions. This vector is useful in assessing the coverage of intervals from the posterior draws from Analyze and AnalyzeWithExitPoll.

Details

This function simulates data from the ecological inference model outlined in Greiner \& Quinn (2009). At the user's option (by setting nprecincts.ep to an integer greater than 0), the function generates three survey samples from the simulated dataset. The specifics of the function's operation are as follows.

First, the function simulates the total number of individual units (voters) in each contigency table (precinct) from a Poisson distribution with parameter lambda * runif(1, dispersion.low.lim, dispersion.up.lim). Next, for each table, the function simulates the vector of fraction of units (voters) in each table (precinct) row. The fractions are simulated from a Dirichlet distribution with parameter vector housing.seg * alpha. The row fractions are multiplied by the total number of units (voters), and the resulting vector is rounded to produce contingency table row counts for each table.

Next, a vector \(mu\) is simulated from a multivariate normal with mean mu0 and covariance matrix K0. A covariance matrix Sigma is simulated from an Inv-Wishart with nu0 degrees of freedom and scale matrix Psi0.

Next, nprecincts vectors are drawn from \(N(mu, SIGMA)\). Each of these draws undergoes an inverse-stacked multidimensional logistic transformation to produce a set of nrowcat probability vectors (each of which sums to one) for nrowcat multinomial distributions, one for each row in that contingency table. Next, the nrowcat multinomial values, which represent the true (and in real life, unobserved) internal cell counts, are drawn from the relevant row counts and these probability vectors. The column totals are calculated via summation.

If nprecincts.ep is greater than 0, three simulated surveys (exit polls) are drawn. All three select contingency tables (precincts) using weights that are a function of the composition of the row totals. Specifically the row fractions are raised to a power q and then summed (when q = 2 this calculation is known in antitrust law as a Herfindahl index). For one of the three surveys (exit polls) gendata.ep generates, these quasi-Herfindahl indices are the weights. For two of the three surveys (exit polls) gendata.ep generates, denoted EPInv and EPReas, the sample weights are the reciprocals of these quasi-Herfindhal indices. The former method tends to weight contingency tables (precincts) in which one row dominates the table higher than contigency tables (precincts) in which row fractions are close to the same. In voting parlance, precincts in which one racial group dominates are more likely to be sampled than racially mixed precincts. The latter method, in which the sample weights are reciprocated, weights contingency tables in which row fractions are similar more highly; in voting parlance, mixed-race precincts are more likly to be sampled.

For example, suppose nrowcat = 3, HerInvexp = 3.5, HerfReas = 2, and HerfNoInv = 3.5. Consider contingency table P1 with row counts (300, 300, 300) and contingency table P2 with row counts (950, 25, 25). Then:

Row fractions: The corresponding row fractions are (300/900, 300/900, 300/900) = (.33, .33, .33) and (950/1000, 25/1000, 25/1000) = (.95, .025, .025).

EPInv weights: EPInv would sample from assign P1 and P2 weights as follows: \(1/sum(.33^3.5, .33^3.5, .33^3.5) = 16.1\) and \(1/sum(.95^3.5, .025^3.5, .025^3.5) = 1.2\).

EPReas weights: EPReas would assign weights as follows: \(1/sum(.33^2, .33^2, .33^2) = 3.1\) and \(1/sum(.95^2, .025^2, .025^2) = 1.1\).

EPNoInv weights: EPNoInv would assign weights as follows: \(sum(.33^3.5, .33^3.5, .33^3.5) = .062\) and \(sum(.95^3.5, .025^3.5, .025^3.5) = .84\).

For each of the three simulated surveys (EPInv, EPReas, and EPNoInv), gendata.ep returns a list of length three. The first element of the list, returnmat.ep, is a matrix of dimension nprecincts by (nrowcat * ncolcat) suitable for passing to TuneWithExitPoll and AnalyzeWithExitPoll. That is, the first row of returnmat.ep corresponds to the first row of GQdata, meaning that they both contain information from the same contingency table. The second row of returnmat.ep contains information from the contingency table represented in the second row of GQdata. And so on. In addition, returnmat.ep has counts from the sample of the contingency table in vectorized row major format, as required for TuneWithExitPoll and AnalyzeWithExitPoll.

If nrowcat = ncolcat = 3, then the user may set his.agg.bias.vec to be nonzero. This will introduce aggregation bias into the data by making the probability vector of the second row of each contingency table a function of the fractional composition of the third row. In voting parlance, if the rows are black, white, and Hispanic, the white voting behavior will be a function of the percent Hispanic in each precinct. For example, if his.agg.bias.vec = c(1.7, -3), and if the fraction Hispanic in each precinct i is \(X_hi\), then in the ith precinct, the \(mu_i[3]\) is set to mu0[3] + \(X_hi * 1.7\), while \(mu_i[4]\) is set to mu0[4] + \(X_hi * -3\). This feature allows testing of the ecological inference model with aggregation bias.

References

D. James Greiner \& Kevin M. Quinn. 2009. ``R x C Ecological Inference: Bounds, Correlations, Flexibility, and Transparency of Assumptions.'' J.R. Statist. Soc. A 172:67-81.

Examples

Run this code

# NOT RUN {
SimData <- gendata.ep()    #  simulated data
FormulaString <- "Dem, Rep, Abs ~ bla, whi, his"
EPInvTune <-  TuneWithExitPoll(fstring = FormulaString,
                               data = SimData$GQdata,
                               exitpoll=SimData$EPInv$returnmat.ep,
                               num.iters = 10000,
                               num.runs = 15)
EPInvChain1 <- AnalyzeWithExitPoll(fstring = FormulaString,
                                   data = SimData$GQdata,
                                   exitpoll=SimData$EPInv$returnmat.ep,
                                   num.iters = 2000000,
                                   burnin = 200000,
                                   save.every = 2000,
                                   rho.vec = EPInvTune$rhos,
                                   print.every = 20000,
                                   debug = 1,
                                   keepTHETAS = 0,
                                   keepNNinternals = 0)
EPInvChain2 <- AnalyzeWithExitPoll(fstring = FormulaString,
                                   data = SimData$GQdata,
                                   exitpoll=SimData$EPInv$returnmat.ep,
                                   num.iters = 2000000,
                                   burnin = 200000,
                                   save.every = 2000,
                                   rho.vec = EPInvTune$rhos,
                                   print.every = 20000,
                                   debug = 1,
                                   keepTHETAS = 0,
                                   keepNNinternals = 0)
EPInvChain3 <- AnalyzeWithExitPoll(fstring = FormulaString,
                                   data = SimData$GQdata,
                                   exitpoll=SimData$EPInv$returnmat.ep,
                                   num.iters = 2000000,
                                   burnin = 200000,
                                   save.every = 2000,
                                   rho.vec = EPInvTune$rhos,
                                   print.every = 20000,
                                   debug = 1,
                                   keepTHETAS = 0,
                                   keepNNinternals = 0)
EPInv <- mcmc.list(EPInvChain1, EPInvChain2, EPInvChain3)
# }

Run the code above in your browser using DataLab