This function aims to detect cellwise outliers in the data. These are entries in the data matrix which are substantially higher or lower than what could be expected based on the other cells in its column as well as the other cells in its row, taking the relations between the columns into account. Note that this function first calls checkDataSet
and analyzes the remaining cleaned data.
DDC(X, DDCpars = list())
A list with components:
DDCpars
The list of options used.
colInAnalysis
The column indices of the columns used in the analysis.
rowInAnalysis
The row indices of the rows used in the analysis.
namesNotNumeric
The names of the variables which are not numeric.
namesCaseNumber
The name of the variable(s) which contained the case numbers and was therefore removed.
namesNAcol
Names of the columns left out due to too many NA
's.
namesNArow
Names of the rows left out due to too many NA
's.
namesDiscrete
Names of the discrete variables.
namesZeroScale
Names of the variables with zero scale.
remX
Cleaned data after checkDataSet
.
locX
Estimated location of X
.
scaleX
Estimated scales of X
.
Z
Standardized remX
.
nbngbrs
Number of neighbors used in estimation.
ngbrs
Indicates neighbors of each column, i.e. the columns most correlated with it.
robcors
Robust correlations.
robslopes
Robust slopes.
deshrinkage
The deshrinkage factor used for every connected (i.e. non-standalone) column of X
.
Xest
Predicted X
.
scalestres
Scale estimate of the residuals X - Xest
.
stdResid
Residuals of orginal X
minus the estimated Xest
, standardized by column.
indcells
Indices of the cells which were flagged in the analysis.
Ti
Outlyingness value of each row.
medTi
Median of the Ti values.
madTi
Mad of the Ti values.
indrows
Indices of the rows which were flagged in the analysis.
indNAs
Indices of all NA cells.
indall
Indices of all cells which were flagged in the analysis plus all cells in flagged rows plus the indices of the NA cells.
Ximp
Imputed X
.
X
is the input data, and must be an \(n\) by \(d\) matrix or a data frame.
A list of available options:
fracNA
Only consider columns and rows with fewer NAs (missing
values) than this fraction (percentage). Defaults to \(0.5\).
numDiscrete
A column that takes on numDiscrete
or fewer values will
be considered discrete and not used in the analysis. Defaults to \(3\).
precScale
Only consider columns whose scale is larger than precScale
.
Here scale is measured by the median absolute deviation. Defaults to \(1e-12\).
cleanNAfirst
If "columns"
, first columns then rows are checked for NAs.
If "rows"
, first rows then columns are checked for NAs.
"automatic"
checks columns first if \(d \geq 5n\) and rows first otherwise.
Defaults to "automatic"
.
tolProb
Tolerance probability, with default \(0.99\), which
determines the cutoff values for flagging outliers in
several steps of the algorithm.
corrlim
When trying to estimate \(z_{ij}\) from other variables \(h\), we
will only use variables \(h\) with \(|\rho_{j,h}| \ge corrlim\).
Variables \(j\) without any correlated variables \(h\) satisfying
this are considered standalone, and treated on their own. Defaults to \(0.5\).
combinRule
The operation to combine estimates of \(z_{ij}\) coming from
other variables \(h\): can be "mean"
, "median"
,
"wmean"
(weighted mean) or "wmedian"
(weighted median).
Defaults to wmean
.
returnBigXimp
If TRUE, the imputed data matrix Ximp
in the output
will include the rows and columns that were not
part of the analysis (and can still contain NAs). Defaults to FALSE
.
silent
If TRUE
, statements tracking the algorithm's progress will not be printed. Defaults to FALSE
.
nLocScale
When estimating location or scale from more than nLocScale
data values, the computation is based on a random sample of size nLocScale
to save time. When nLocScale = 0
all values are used. Defaults to 25000.
fastDDC
Whether to use the fastDDC option or not. The fastDDC algorithm uses approximations
to allow to deal with high dimensions. Defaults to TRUE
for \(d > 750\) and FALSE
otherwise.
standType
The location and scale estimators used for robust standardization. Should be one of "1stepM"
, "mcd"
or "wrap"
. See estLocScale
for more info. Only used when fastDDC = FALSE
. Defaults to "1stepM"
.
corrType
The correlation estimator used to find the neighboring variables. Must be one of "wrap"
(wrapping correlation), "rank"
(Spearman correlation) or "gkwls"
(Gnanadesikan-Kettenring correlation followed by weighting). Only used when fastDDC
= FALSE
. Defaults to "gkwls"
.
transFun
The transformation function used to compute the robust correlations when fastDDC = TRUE
. Can be "wrap"
or "rank"
. Defaults to "wrap"
.
nbngbrs
When fastDDC = TRUE
, each column is predicted from at most nbngbrs
columns correlated to it.
Defaults to 100.
Raymaekers J., Rousseeuw P.J., Van den Bossche W.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
Raymaekers, J., Rousseeuw P.J. (2019). Fast robust correlation for high dimensional data. Technometrics, 63(2), 184-198. (link to open access pdf)
checkDataSet
,cellMap
library(MASS); set.seed(12345)
n <- 50; d <- 20
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 50, FALSE)] <- NA
x[sample(1:(n * d), 50, FALSE)] <- 10
x[sample(1:(n * d), 50, FALSE)] <- -10
x <- cbind(1:n, x)
DDCx <- DDC(x)
cellMap(DDCx$stdResid)
# For more examples, we refer to the vignette:
if (FALSE) {
vignette("DDC_examples")
}
Run the code above in your browser using DataLab