Imputation is a family of statistical methods for replacing missing
values with estimates. Introduced by Rubin and Schenker (1986) and
Rubin (1987), Multiple Imputation (MI) is a family of imputation
methods that includes multiple estimates, and therefore includes
variability of the estimates.
The Multiple Imputation Sequential Sampler (MISS) function performs
MI by determining the type of variable and therefore the sampler for
each variable, and then sequentially progresses through each variable
in the data set that has missing values, updating its prediction of
those missing values given all other variables in the data set each
iteration.
MI is best performed within a model, where it is called
full-likelihood imputation. Examples may be found in the "Examples"
vignette. However, sometimes it is impractical to impute within a
model when there are numerous missing values and a large number of
parameters are therefore added. As an alternative, MI may be
performed on the data set before the data is passed to the model,
such as in the IterativeQuadrature
,
LaplaceApproximation
, LaplacesDemon
, or
VariationalBayes
function. This is less desirable, but
MISS is available for MCMC-based MI in this case.
Missing values are initially set to column means for continuous
variables, and are set to one for discrete variables.
MISS uses the following methods and MCMC algorithms:
Missing values of continuous variables are estimated with a
ridge-stabilized linear regression Gibbs sampler.
Missing values of binary variables that have only 0 or 1 for values
are estimated either with a binary robit (t-link logistic
regression model) Gibbs sampler of Albert and Chib (1993).
Missing values of discrete variables with 3 or more (ordered or
unordered) discrete values are considered continuous.
In the presence of big data, it is suggested that the user
sequentially read in batches of data that are small enough to be
manageable, and then apply the MISS function to each batch. Each batch
should be representative of the whole, of course.
It is common for multiple imputation functions to handle variable
transformations. MISS does not transform variables, but imputes what
it gets. For example, if a user has a variable that should be positive
only, then it is recommended here that the user log-transform the
variable, pass the data set to MISS, and when finished, exponentiate
both the observed and imputed values of that variable.
The CenterScale
function should also be considered to speed up
convergence.
It is hoped that MISS is helpful, though it is not without limitation
and there are numerous alternatives outside of the
LaplacesDemon
package. If MISS does not fulfill the needs of
the user, then the following packages are recommended: Amelia, mi, or
mice. MISS emphasizes MCMC more than these alternatives, though MISS is
not as extensive. When a data set does not have a simple structure,
such as merely continuous or binary or unordered discrete, the
LaplacesDemon
function should be considered, where a
user can easily specify complicated structures such as multilevel,
spatial or temporal dependence, and more.
Matrix inversions are required in the Gibbs sampler. Matrix inversions
become more cumbersome as the number \(J\) of variables increases.
If a large number of iterations is used, then the user may consider
studying the imputations for approximate convergence with the
BMK.Diagnostic
function, by supplying the transpose of
codeFit$Imp. In the presence of numerous missing values, say more
than 100, the user may consider iterating through the study of the
imputations of 100 missing values at a time.