This function generates a random data.frame
with a
missingness mechanism that is used to impose a missingness pattern. The primary
purpose of this function is for use in simulations
rdata.frame(N = 1000,
restrictions = c("none", "MARish", "triangular", "stratified", "MCAR"),
last_CPC = NA_real_, strong = FALSE, pr_miss = .25, Sigma = NULL,
alpha = NULL, experiment = FALSE,
treatment_cor = c(rep(0, n_full - 1), rep(NA, 2 * n_partial)),
n_full = 1, n_partial = 1, n_cat = NULL,
eta = 1, df = Inf, types = "continuous", estimate_CPCs = TRUE)
integer indicating the number of observations
character string indicating what restrictions to impose on the the missing data mechansim, see the Details section
a numeric scalar between \(-1\) and \(1\) exclusive or
NA_real_
(the default). If not NA_real_
, then this value will
be used to construct the correlation matrix from which the data are drawn.
This option is useful if restrictions is "triangular"
or "stratified"
,
in which case the degree to which last_CPC
is not zero causes a violation of
the Missing-At-Random assumption that is confined to the last of the partially
observed variables
Integer among 0, 1, and 2 indicating how strong to
make the instruments with multiple partially observed variables,
in which case the missingness indicators for each partially observed variable
can be used as instruments when predicting missingness on other partially
observed variables. Only applies when restrictions = "triangular"
numeric scalar on the (0,1) interval or vector
of length n_partial
indicating the proportion of observations
that are missing on partially observed variables
Either NULL
(the default) or a correlation matrix
of appropriate order for the variables (including the missingness
indicators). By default, such a matrix is generated at random.
Either NULL
, NA
, or a numeric
vector of appropriate length that governs the skew of a multivariate
skewed normal distribution; see rmsn
. The appropriate
length is n_full - 1 + 2 * n_partial
iff none of the variable types
is nominal. If some of the variable types are nominal, then the appropriate
length is n_full - 1 + 2 * n_partial + sum(n_cat) - length(n_cat)
.
If NULL
, alpha
is taken to be zero, in which case the
data-generating process has no skew. If NA
, alpha
is drawn from rt
with df
degrees of freedom
logical indicating whether to simulate a randomized experiment
Numeric vector of appropriate length indicating the
correlations between the treatment variable and the other variables, which
is only relevant if experiment = TRUE
. The appropriate length is
n_full - 1 + 2 * n_partial
iff none of the variable types is nominal.
If some of the variable types are nominal, then the appropriate length is
n_full - 1 + 2 * n_partial + sum(n_cat) - length(n_cat)
. If
treatment_cor is of length one and is zero, then it will be recylced to
the appropriate length. The treatment variable should be uncorrelated with
intended covariates and uncorrelated with missingness on intended
covariates. If any elements of treatment_cor are NA
, then
those elements will be replaced with random draws. Note that the order of
the random variables is: all fully observed variables,all partially observed
but not nominal variables, all partially observed nominal variables, all
missingness indicators for partially observed variables.
integer indicating the number of fully observed variables
integer indicating the number of partially observed variables
Either NULL
or an integer vector (possibly of
length one) indicating the number of categories in each partially observed
nominal or ordinal variable; see the Details section
Positive numeric scalar which serves as a hyperparameter in the data-generating process. The default value of 1 implies that the correlation matrix among the variables is jointly uniformally distributed, using essentially the same logic as in the clusterGeneration package
positive numeric scalar indicating the degress of freedom for the
(possibly skewed) multivariate t distribution, which defaults to
Inf
implying a (possibly skewed) multivariate normal
distribution
a character vector (possibly of length one, in which case it
is recycled) indicating the type for each fully observed and partially
observed variable, which currently can be among "continuous"
,
"count"
, "binary"
, "treatment"
(which is binary),
"ordinal"
, "nominal"
, "proportion"
, "positive"
.
See the Details section. Unique abbreviations are acceptable.
A logical indicating whether the canonical partial correlations
between the partially observed variables and the latent missingnesses should
be estimated. The default is TRUE
but considerable wall time can be saved
by switching it to FALSE
when there are many partially observed variables.
A list with the following elements:
true a data.frame
containing no NA
values
obs a data.frame
derived from the previous with some
NA
values that represents a dataset that could be observed
empirical_CPCs a numeric vector of empirical Canonical Partial
Correlations, which should differ only randomly from zero iff
MAR = TRUE
and the data-generating process is multivariate normal
L a Cholesky factor of the correlation matrix used to generate the true data
In addition, if alpha
is not NULL
, then the following
elements are also included:
alpha the alpha
vector utilized
sn_skewness the skewness of the multivariate skewed normal distribution
in the population; note that this value is only an approximation of the
skewness when df < Inf
sn_kurtosis the kurtosis of the multivariate skewed normal distribution
in the population; note that this value is only an approximation of the
kurtosis when df < Inf
By default, the correlation matrix among the variables and missingness indicators
is intended to be close to uniform, although it is often not possible to achieve
exactly. If restrictions = "none"
, the data will be Not Missing At Random
(NMAR). If restrictions = "MARish"
, the departure from Missing At Random
(MAR) will be minimized via a call to optim
, but generally will
not fully achieve MAR. If restrictions = "triangular"
, the MAR assumption
will hold but the missingness of each partially observed variable will only
depend on the fully observed variables and the other latent missingness indicators.
If restrictions = "stratified"
, the MAR assumption will hold but the
missingness of each partially observed variable will only depend on the fully
observed variables. If restrictions = "MCAR"
, the Missing Completely At
Random (MCAR) assumption holds, which is much more restrictive than MAR.
There are some rules to follow, particularly when specifying types
.
First, if experiment = TRUE
, there must be exactly one treatment
variable (taken to be binary) and it must come first to ensure that the
elements of treatment_cor
are handled properly. Second, if there are any
partially observed nominal variables, they must come last; this is to ensure
that they are conditionally uncorrelated with each other. Third, fully observed
nominal variables are not supported, but they can be made into ordinal variables
and then converted to nominal after the fact. Fourth, including both ordinal and
nominal partially observed variables is not supported yet, Finally, if any
variable is specified as a count, it will not be exactly consistent with the
data-generating process. Essentially, a count variable is constructed from a
continuous variable by evaluating pt
on it and passing that to
qpois
with an intensity parameter of 5. The other non-continuous
variables are constructed via some transformation or discretization of a continuous
variable.
If some partially observed variables are either ordinal or nominal (but not both),
then the n_cat
argument governs how many categories there are. If n_cat
is NULL
, then the number of categories defaults to three. If
n_cat
has length one, then that number of categories will be used for all
categorical variables but must be greater than two. Otherwise, the length of
n_cat
must match the number of partially observed categorical variables and
the number of categories for the \(i\)th such variable will be the \(i\)th element
of n_cat
.
# NOT RUN {
rdf <- rdata.frame(n_partial = 2, df = 5, alpha = rnorm(5))
print(rdf$empirical_CPCs) # not zero
rdf <- rdata.frame(n_partial = 2, restrictions = "triangular", alpha = NA)
print(rdf$empirical_CPCs) # only randomly different from zero
print(rdf$L == 0) # some are exactly zero by construction
mdf <- missing_data.frame(rdf$obs)
show(mdf)
hist(mdf)
image(mdf)
# a randomized experiment
rdf <- rdata.frame(n_full = 2, n_partial = 2,
restrictions = "triangular", experiment = TRUE,
types = c("t", "ord", "con", "pos"),
treatment_cor = c(0, 0, NA, 0, NA))
Sigma <- tcrossprod(rdf$L)
rownames(Sigma) <- colnames(Sigma) <- c("treatment", "X_2", "y_1", "Y_2",
"missing_y_1", "missing_Y_2")
print(round(Sigma, 3))
# }
Run the code above in your browser using DataLab