emve: Extended Minimum Volume Ellipsoid (EMVE) in the presence of missing data

Description

Computes the Extended S-Estimate (ESE) version of the minimum volume ellipsoid (EMVE), which is used as an initial estimator in Generlized S-Estimator (GSE) for missing data by default.

Usage

emve(x, maxits=5, sampling=c("uniform","cluster"), n.resample, n.sub.size, seed)

Value

An S4 object of class emve-class which is a subclass of the virtual class CovRobMissSc-class. The output S4 object contains the following slots:

`mu`	Estimated location. Can be accessed via `getLocation`.
`S`	Estimated scatter matrix. Can be accessed via `getScatter`.
`sc`	Estimated EMVE scale. Can be accessed via `getScale`.
`pmd`	Squared partial Mahalanobis distances. Can be accessed via `getDist`.
`pmd.adj`	Adjusted squared partial Mahalanobis distances. Can be accessed via `getDistAdj`.
`pu`	Dimension of the observed entries for each case. Can be accessed via `getDim`.
`call`	Object of class `"language"`. Not meant to be accessed.
`x`	Input data matrix. Not meant to be accessed.
`p`	Column dimension of input data matrix. Not meant to be accessed.
`estimator`	Character string of the name of the estimator used. Not meant to be accessed.

Arguments

x: a matrix or data frame. May contain missing values, but cannot contain columns with completely missing entries.
maxits: integer indicating the maximum number of iterations of Gaussian MLE calculation for each subsample. Default is 5.
sampling: which sampling scheme is to use: 'uniform' or 'cluster' (see Leung and Zamar, 2016). Default is 'uniform'.
n.resample: integer indicating the number of subsamples. Default is 15 for clustering-based subsampling and 500 for uniform subsampling.
n.sub.size: integer indicating the sizes of each subsample. Default is 2(p+1)/a for clustering-based subsampling and (p+1)/a for uniform subsampling, where a is proportion of non-missing cells.
seed: optional starting value for random generator. Default is seed = 1000.

Author

Andy Leung andy.leung@stat.ubc.ca, Ruben H. Zamar, Mike Danilov, Victor J. Yohai

Details

This function computes EMVE as described in Danilov et al. (2012). Two subsampling schemes can be used for computing EMVE: uniform subsampling and the clustering-based subsampling as described in Leung and Zamar (2016). For uniform subsampling, the number of subsamples must be large to ensure high breakdown point. For clustering-based subsampling, the number of subsamples can be smaller. The subsample size \(n_0\) must be chosen to be larger than \(p\) to avoid singularity.

In the algorithm, there exists a concentration step in which Gaussian MLE is computed for \(50\%\) of the data points using the classical EM-algorithm multiplied by a scalar factor. This step is repeated for each subsample. As the computation can be heavy as the number of subsample increases, we set by default the maximum number of iteration of classical EM-algorithm (i.e. maxits) as 5. Users are encouraged to refer to Danilov et al. (2012) for details about the algorithm and Rubin and Little (2002) for the classical EM-algorithm for missing data.

References

Danilov, M., Yohai, V.J., Zamar, R.H. (2012). Robust Esimation of Multivariate Location and Scatter in the Presence of Missing Data. Journal of the American Statistical Association 107, 1178--1186.

Leung, A. and Zamar, R.H. (2016). Multivariate Location and Scatter Matrix Estimation Under Cellwise and Casewise Contamination. Submitted.

Rubin, D.B. and Little, R.J.A. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley.