The function implements a simple, automatic outlier detection method suitable for high dimensional data that treats each class independently and uses a statistically principled threshold for outliers. The algorithm can detect both mislabeled and abnormal samples without reference to other classes.
OutlierPCDist(x, ...)
# S3 method for default
OutlierPCDist(x, grouping, control, k, explvar, trace=FALSE, ...)
# S3 method for formula
OutlierPCDist(formula, data, ..., subset, na.action)
An S4 object of class OutlierPCDist
which
is a subclass of the virtual class Outlier
.
a formula with no response variable, referring only to numeric variables.
an optional data frame (or similar: see
model.frame
) containing the variables in the
formula formula
.
an optional vector used to select rows (observations) of the
data matrix x
.
a function which indicates what should happen
when the data contain NA
s. The default is set by
the na.action
setting of options
, and is
na.fail
if that is unset. The default is na.omit
.
arguments passed to or from other methods.
a matrix or data frame.
grouping variable: a factor specifying the class for each observation.
a control object (S4) for one of the available control classes,
e.g. CovControlMcd-class
, CovControlOgk-class
,
CovControlSest-class
, etc.,
containing estimation options. The class of this object defines
which estimator will be used. Alternatively a character string can be specified
which names the estimator - one of auto, sde, mcd, ogk, m, mve, sfast, surreal,
bisquare, rocke. If 'auto' is specified or the argument is missing, the
function will select the estimator (see below for details)
Number of components to select for PCA. If missing, the number of components will be calculated automatically
Minimal explained variance to be used for calculation of
the number of components in PCA. If explvar
is not provided,
automatic dimensionality selection using profile likelihood, as
proposed by Zhu and Ghodsi will be used.
whether to print intermediate results. Default is trace = FALSE
Valentin Todorov valentin.todorov@chello.at
If the data set consists of two or more classes
(specified by the grouping variable grouping
) the proposed method iterates
through the classes present in the data, separates each class from the rest and
identifies the outliers relative to this class, thus treating both types of outliers,
the mislabeled and the abnormal samples in a homogenous way.
The first step of the algorithm is dimensionality reduction using (classical) PCA. The number of components to select can be provided by the user but if missing, the number of components will be calculated either using the provided minimal explained variance or by the automatic dimensionality selection using profile likelihood, as proposed by Zhu and Ghodsi.
A.D. Shieh and Y.S. Hung (2009). Detecting Outlier Samples in Microarray Data, Statistical Applications in Genetics and Molecular Biology 8.
M. Zhu, and A. Ghodsi (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51, pp. 918--930.
Filzmoser P & Todorov V (2013). Robust tools for the imperfect world, Information Sciences 245, 4--20. tools:::Rd_expr_doi("10.1016/j.ins.2012.10.017").
OutlierPCDist
, Outlier
data(hemophilia)
obj <- OutlierPCDist(gr~.,data=hemophilia)
obj
getDistance(obj) # returns an array of distances
getClassLabels(obj, 1) # returns an array of indices for a given class
getCutoff(obj) # returns an array of cutoff values (for each class, usually equal)
getFlag(obj) # returns an 0/1 array of flags
plot(obj, class=2) # standard plot function
Run the code above in your browser using DataLab