Computes a robust and deterministic multivariate location and scatter
estimate with a high breakdown point, using the DetMCD
(Deterministic
Minimum Covariance Determinant) algorithm.
DetMCD(X,h=NULL,alpha=0.75,scale_est="Auto",tol=1e-07)
a numeric matrix or data frame. Missing values (NaN's) and infinite values (Inf's) are allowed: observations (rows) with missing or infinite values will automatically be excluded from the computations.
Ignored if h!=NULL
. (Possibly vector of) numeric parameter controlling the size of the subsets over
which the determinant is minimized, i.e., alpha*n
observations are used for computing the determinant. Allowed
values are between 0.5 and 1 and the default is 0.75.
numeric integer parameter controlling the size of the subsets over
which the determinant is minimized, i.e., h
observations are used for computing the determinant. Allowed
values are between [(n+p+1)/2]
and n
and the default is NULL.
a character string specifying the
variance functional. Possible values are "qn", "tau" and 'Auto".
Default value "Auto"
is to use the Qn
estimator for data with less than 1000 observations, and to use the
tau-scale for data sets with more observations. But one
can also always use the Qn estimator "qn"
or the tau scale "tau"
.
a small positive numeric value to be used for determining numerical 0.
A list with components:
The raw MCD location of the data.
The raw MCD covariance matrix (multiplied by a consistency factor).
The determinant of the raw MCD covariance matrix.
The robust distance of each observation to the raw MCD center, relative to the raw MCD scatter estimate.
Weights based on the estimated raw covariance matrix 'raw.cov' and the estimated raw location 'raw.center' of the data. These weights determine which observations are used to compute the final MCD estimates.
The robust location of the data, obtained after reweighting.
The robust covariance matrix, obtained after reweighting.
The number of observations that have determined the MCD estimator, i.e. the value of h.
The identifier of the initial shape estimate which led to the optimal result.
The subset of h points whose covariance matrix has minimal determinant.
The finale vector of weights.
The robust distance of each observation to the final, reweighted MCD center of the data, relative to the reweighted MCD scatter of the data. These distances allow us to easily identify the outliers.
The Mahalanobis distance of each observation (distance from the classical center of the data, relative to the classical shape of the data).
Same as the X in the call to DetMCD, without rows containing missing or infinite values.
The vector of values of alpha used in the algorithm.
The vector of scale estimators used in the estimates (one of tau2
or qn
.
DetMCD computes the MCD estimator of a multivariate data set in a deterministic way. This estimator is given by the subset of h observations with smallest covariance determinant. The MCD location estimate is then the mean of those h points, and the MCD scatter estimate is their covariance matrix. The default value of h is roughly 0.75n (where n is the total number of observations), but the user may choose each value between n/2 and n. Based on the raw estimates, weights are assigned to the observations such that outliers get zero weight. The reweighted MCD estimator is then given by the mean and covariance matrix of the cases with non-zero weight.
To compute the MCD estimator, six initial robust h-subsets are constructed based on robust transformations of variables or robust and fast-to-compute estimators of multivariate location and shape. Then C-steps are applied on these h-subsets until convergence. Note that the resulting algorithm is not fully affine equivariant, but it is often faster than the FAST-MCD algorithm which is affine equivariant. Note that this function can not handle exact fit situations: if the raw covariance matrix is singular, the program is stopped. In that case, it is recommended to apply the FastMCD function.
The MCD method is intended for continuous variables, and assumes that the number of observations n is at least 5 times the number of variables p. If p is too large relative to n, it would be better to first reduce p by variable selection or robust principal components (see the functions PcaHubert).
Hubert, M., Rousseeuw, P.J. and Verdonck, T. (2012), "A deterministic algorithm for robust location and scatter", Journal of Computational and Graphical Statistics, Volume 21, Number 3, Pages 618--637.
Verboven, S., Hubert, M. (2010). Matlab library LIBRA, Wiley Interdisciplinary Reviews: Computational Statistics, 2, 509--515.
# NOT RUN {
## generate data
set.seed(1234) # for reproducibility
alpha<-0.5
n<-101
p<-5
#generate correlated data
D<-diag(rchisq(p,df=1))
W<-matrix(0.9,p,p)
diag(W)<-1
W<-D
# }
# NOT RUN {
<!-- %*%W%*%t(D) -->
# }
# NOT RUN {
V<-chol(W)
x<-matrix(rnorm(n*p),nc=p)
x<-scale(x)
# }
# NOT RUN {
<!-- %*%V -->
# }
# NOT RUN {
result<-DetMCD(x,scale_est="tau",alpha=alpha)
plot(result, which = "dd")
#compare to robustbase:
result<-DetMCD(x,scale_est="qn",alpha=alpha)
resultsRR<-covMcd(x,nsamp='deterministic',scalefn=qn,alpha=alpha)
#should be the same:
result$crit
resultsRR$crit
#Example with several values of alpha:
alphas<-seq(0.5,1,l=6)
results<-DetMCD(x,scale_est="qn",alpha=alphas)
plot(results, h.val = 2, which = "dd")
# }
Run the code above in your browser using DataLab