mice(data, m = 5,
method = vector("character",length=ncol(data)),
predictorMatrix = (1 - diag(1, ncol(data))),
visitSequence = (1:ncol(data))[apply(is.na(data),2,any)],
post = vector("character", length = ncol(data)),
defaultMethod = c("pmm","logreg","polyreg","polr"),
maxit = 5,
diagnostics = TRUE,
printFlag = TRUE,
seed = NA,
imputationMethod = NULL,
defaultImputationMethod = NULL,
data.init = NULL,
...
)
NA
.m=5
.ncol(data)
,
specifying the elementary imputation method to be used
for each column in data. If specified as a single
string, the same method will be used for alncol(data)
containing 0/1 data specifying
the set of predictors to be used for each target column. Rows correspond
to target variables (i.e. variables to be imputed), in the sequence as
they appear in data. A value of ncol(data)
,
specifying expressions. Each string is parsed and executed within the
sampler()
function to postprocess imputed values.
The default is to do nothing, indicated by a vector TRUE
, diagnostic
information will be appended to the value of the function. If
FALSE
, only the imputed data are saved. The default is TRUE
.TRUE
, mice
will print history on console. Use print=FALSE
for silent computation.set.seed()
for offsetting the random number generator. Default is to leave the random number generator alone.method
argument. Included for backwards compatibility.defaultMethod
argument. Included for backwards compatibility.data
, without
missing data, used to initialize imputations before the start of the iterative process.
The default NULL
implies that starting imputation are created by a simple randmids
(multiply imputed data set) with componentsncol(data)
containing the number of missing observations
per columnncol(data)
components with the generated multiple imputations.
Each part of the list is a nmis[j]
by m matrix of imputed values for
variable data[,j]
. The component equals NULL
for columns without
missing data.ncol(data)
specifying the elementary
imputation method per columnncol(data)
containing 0/1 data specifying
the predictor setncol(data)
with commands for post-processingas.integer()
.
Note that observed data are not present in this mean.chainMean
, containing the variances
of the imputed values.pad$data
(data padded with columns for factors), pad$predictorMatrix
(predictor matrix for the padded data), pad$method
(imputation methods applied
to the padded data), the vector pad$visitSequence
(the visit sequence applied to the padded
data), pad$post
(post-processing commands for padded data) and
categories
(a matrix containing descriptive information about the padding
operation).NULL
is no action was made.
At initialization the program does the following
three actions: 1. A variable that contains missing values, that is not imputed and
that is used as a predictor is removed, 2. a constant variable is
removed, and 3. a collinear variable is removed. During iteration,
the program does the following actions: 1. one or more variables
that are linearly dependent are removed (for categorical data, a
'variable' corresponds to a dummy variable), and 2. proportional odds
regression imputation that does not converge and is replaced by
polyreg
. Column it
is the iteration number at
which the record was added, im
is the imputation
number, co
is the column number in the data, dep
is
the name of the name of the dependent variable, meth
is the
imputation method used, and out
is a (possibly long)
character vector with the names of the altered or removed
predictors.A separate univariate imputation model can be specified for each column. The default imputation method depends on the measurement level of the target column. In addition to these, several other methods are provided. You can also write their own imputation functions, and call these from within the algorithm.
The data may contain categorical variables that are used in a
regressions on other variables. The algorithm creates dummy variables
for the categories of these variables, and imputes these from the
corresponding categorical variable. The extended model containing the
dummy variables is called the padded model. Its structure is stored in
the list component pad
.
Built-in elementary imputation methods are:
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
These corresponding functions are coded in the mice
library under
names
mice.impute.method
, where method
is a string with the name of the elementary imputation method name,
for example norm
. The method
argument specifies the methods to be used.
For the j
'th column, mice()
calls the first occurence of
paste("mice.impute.",method[j],sep="")
in the search path.
The mechanism allows uses to write customized imputation function,
mice.impute.myfunc
. To call it for all columns specify
method="myfunc"
.
To call it only for, say, column 2 specify
method=c("norm","myfunc","logreg",...)
.
Passive imputation:
mice()
supports a special built-in method, called passive imputation. This
method can be used to ensure that a data transform always depends on the
most recently generated imputations.
In some cases, an imputation model may need transformed data in addition
to the original data (e.g. log, quadratic, recodes, interaction, sum scores,
and so on).
Passive imputation maintains consistency among different transformations of the same data.
Passive imputation is invoked if ~
is specified as the first
character of the string that specifies the elementary method.
mice()
interprets the entire string, including the ~
character,
as the formula argument
in a call to model.frame(formula, data[!r[,j],])
. This provides a simple
mechanism for specifying determinstic dependencies among the
columns. For example, suppose that the missing entries in
variables data$height
and data$weight
are imputed. The
body mass index (BMI) can be calculated within mice
by
specifying the string "~I(weight/height^2)"
as the elementary
imputation method for the target column data$bmi
.
Note that the ~
mechanism works only on those entries which have
missing values in the target column. You should make sure that the
combined observed and imputed parts of the target column make sense. An
easy way to create consistency is by coding all entries in the target as
NA
, but for large data sets, this could be inefficient.
Note that you may also need to adapt the default predictorMatrix
to
evade linear dependencies among the predictors that could cause errors
like Error in solve.default()
or Error: system is exactly singular
.
Though not strictly needed, it is often
useful to specify visitSequence
such that the column that is imputed by
the ~
mechanism is visited each time after one of its predictors was
visited. In that way, deterministic relation between columns will always
be synchronized.
mice
: Multivariate Imputation by Chained Equations in R
.
Journal of Statistical Software, 45(3), 1-67.
van Buuren, S. (2012). Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC Press.
Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn C.G.M., Rubin, D.B. (2006)
Fully conditional specification in multivariate imputation.
Journal of Statistical Computation and Simulation, 76, 12, 1049--1064.
Van Buuren, S. (2007)
Multiple imputation of discrete and continuous data by fully conditional specification.
Statistical Methods in Medical Research, 16, 3, 219--242.
Van Buuren, S., Boshuizen, H.C., Knook, D.L. (1999)
Multiple imputation of missing blood pressure covariates in survival analysis.
Statistics in Medicine, 18, 681--694.
Brand, J.P.L. (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Dissertation. Rotterdam: Erasmus University.
complete
, mids
, with.mids
, set.seed
# do default multiple imputation on a numeric matrix
imp <- mice(nhanes)
imp
# list the actual imputations for BMI
imp$imputations$bmi
# first completed data matrix
complete(imp)
# imputation on mixed data with a different method per column
mice(nhanes2, meth=c("sample","pmm","logreg","norm"))
Run the code above in your browser using DataLab