The mi
function cannot be run in isolation. It is the most important step of a multi-step process to perform multiple imputation. The data must be specified as a missing_data.frame
before mi
is used to impute missing values for one or more missing_variable
s. An iterative algorithm is used where each missing_variable
is modeled (using fit_model
) as a function of all the other missing_variable
s and their missingness patterns. This documentation outlines the technical uses of the mi
function. For a more general discussion of how to use mi
for multiple imputation, see mi-package
.
mi(y, model, ...)
## Hidden arguments:
## n.iter = 30, n.chains = 4, max.minutes = Inf, seed = NA, verbose = TRUE,
## save_models = FALSE, parallel = .Platform$OS.type != "windows"
Typically an object that inherits from the missing_data.frame-class
,
although many methods are defined for subclasses of the missing_variable-class
.
Alternatively, y = "parallel"
the appropriate parallel backend will be
registered but no imputation performed. See the Details section.
Missing when y = "parallel"
or when y
inherits from the
missing_data.frame-class
but otherwise should be the result of a call to
fit_model
.
Further arguments, the most important of which are
n.iter
number of iterations to perform, defaulting to 30
n.chains
number of chains to use, ideally equal to the number of virtual cores available for use, and defaulting to 4
max.minutes
hard time limit that defaults to 20
seed
either NA
, which is the default, or a psuedo-random number seed
verbose
logical scalar that is TRUE
by default, indicating that
progress of the iterative algorithm should be printed to the screen, which does not
work under Windows when the chains are executed in parallel
save_models
logical scalar that defaults to FALSE
but if TRUE
indicates
that the models estimated on a frozen completed dataset should be saved. This option should be used if the user is interested in evaluating the quality of the models run after the last iteration of the mi
algorithm, but saving these models consumes much more RAM
debug
logical scalar indicating whether to run in debug mode, which forces the processing to be sequential, and allows developers to capture errors within chains
parallel
if TRUE, then parallel processing is used, if available. If FALSE, sequential processing is used. In addition, ths argument may be an object produced by makeCluster
If model
is missing and n.chains
is positive, then the mi
method will return an object of
class "mi"
, which has the following slots:
the call to mi
a list of missing_data.frame
s, one for each chain
an integer vector that records how many iterations have been performed
There are a few methods for such an object, such as show
, summary
,
dimnames
, nrow
, ncol
, etc.
If mi
is called on a missing_data.frame
with model
missing and a nonpositive
n.chains
, then the missing_data.frame
will be returned after allocating storeage.
If model
is not missing, then the mi
method will impute missing values for the y
argument and return it.
It is important to distinguish the two mi
methods that are most relevant to users from the many mi
methods that are less relevant. The primary mi
method is that where y
inherits from the missing_data.frame-class
and model
is omitted. This method “does” the imputation according to the additional arguments described under … above and returns an object of class "mi"
. Executing two or more independent chains is important for monitoring the convergence
of each chain, see Rhats
.
If the chains have not converged in the amount of iterations or time specified, the second important mi
method is that where y
is an object of class "mi"
and model
is omitted, which continues a previous run of the iterative imputation algorithm. All the arguments described under … above remain applicable, except for n.chains
and save_RAM
because these are established by the previous run that is being continued.
The numerous remaining methods are of less importance to users. One mi
method is called when y = "parallel"
and model
is omitted. This method merely sets up the parallel backend so that the chains can be executed in parallel on the local machine. We use the mclapply
function in the parallel package to implement parallel processing on non-Windows machines, and we use the snow package to implement parallel processing on Windows machines; we refer users to the documentation for these packages for more detail about parallel processing. Parallel processing is used by default on machines with multiple processors, but sequential processing can be used instead by using the parallel=FALSE
option. If the user is not using a mulitcore computer, sequential processing is used instead of parallel processing.
The first two mi
methods described above in turn call a mi
method where y
inherits from the missing_data.frame-class
and model
is that which is returned by one of the fit_model-methods
. The methods impute values for the originally missing values of a missing_variable
given a fitted model, according to the imputation_method slot of the missing_variable
in question. Advanced users could define new subclasses of the missing_variable-class
in which case it may be necessary to write such a mi
method for the new class. It will almost certainly be necessary to add to the
fit_model-methods
. The existing mi
and fit-model-methods
should provide a template for doing so.
# NOT RUN {
# STEP 0: Get data
data(CHAIN, package = "mi")
# STEP 1: Convert to a missing_data.frame
mdf <- missing_data.frame(CHAIN) # warnings about missingness patterns
show(mdf)
# STEP 2: change things
mdf <- change(mdf, y = "log_virus", what = "transformation", to = "identity")
# STEP 3: look deeper
summary(mdf)
# STEP 4: impute
# }
# NOT RUN {
imputations <- mi(mdf)
# }
Run the code above in your browser using DataLab