MIPCA: Multiple Imputation with PCA

Description

MIPCA performs Multiple Imputation with a PCA model. Can be used as a preliminary step to perform Multiple Imputation in PCA.

Usage

MIPCA(X, ncp = 2, scale = TRUE, method=c("Regularized","EM"), threshold = 1e-04, 
    nboot = 100,  method.mi="Boot", Lstart=1000, L=100, verbose=FALSE)

Value

res.imputePCA: A matrix corresponding to the imputed dataset obtained with the function imputePCA (the completed dataset)
res.MI: A list of data frames corresponding to the nboot imputed data sets
call: the matched call

Arguments

X: a data.frame with continuous variables containing missing values
ncp: integer corresponding to the number of components used to reconstruct data with the PCA reconstruction formulae
scale: boolean. By default TRUE leading to a same weight for each variable
method: "Regularized" by default or "EM"
threshold: the threshold for the criterion convergence
nboot: the number of imputed datasets
method.mi: a string. If "Bayes", the uncertainty on the parameters of the imputation model is taken into account using a Bayesian treatment of PCA. By default "Boot" leading to a MI which reflect uncertainty a bootstrap procedure. See details.
Lstart: number of iterations for the burn-in period (only used if method.mi="Bayes")
L: number of skipped iterations to keep one imputed data set after the burn-in period (only used if method.mi="Bayes")
verbose: use verbose=TRUE for screen printing of iteration numbers

Author

Francois Husson francois.husson@institut-agro.fr, Julie Josse julie.josse@polytechnique.edu and Vincent Audigier

Details

MIPCA generates nboot imputed datasets from a PCA model. The observed values are the same from one dataset to the others whereas the imputed values change. The variation among the imputed values reflects the variability with which missing values can be predicted. The multiple imputation is proper in the sense of Little and Rubin (2002) since it takes into account the variability of the parameters. Two versions are available: multiple imputation using a parametric bootstrap (Josse, J., Husson, F. (2010)) and multiple imputation using a Bayesian treatment of the PCA model (Audigier et al 2015). The methods differ by the way in which the variability due to missing values is reflected. The method used is controlled by the method.mi argument. By default, MIPCA uses the parametric bootstrap method.mi="Boot". This bootstrap method is more recommended to evaluate uncertainty in PCA (through confidence ellipses). Otherwise, the Bayesian method can be used by specifying the argument method.mi="Bayes". It is based on an iterative algorithm which alternates imputation of the data set and draw of the PCA parameters in a posterior distribution. These steps are repeated Lstart times to reach a convergence. Then, one imputed data set is kept each L iterations to ensure independence between imputed values from a data set to another. The Bayesian method is more recommanded to apply a statistical method on an incomplete data set.

References

Josse, J., Husson, F. (2011). Multiple Imputation in PCA. Advances in Data Analysis and Classification.

Audigier, V. Josse, J., Husson, F. (2015). Multiple imputation for continuous variables using a Bayesian principal component analysis. Journal of Statistical Computation and Simulation.

Little R.J.A., Rubin D.B. (2002) Statistical Analysis with Missing Data. Wiley series in probability and statistics, New-York.