outlyingness: Stahel-Donoho outlyingness of points relative to a dataset

Description

Computes the Stahel-Donoho outlyingness (SDO) of \(p\)-dimensional points z relative to a \(p\)-dimensional dataset x. For each multivariate point \(z_i\), its outlyingness relative to x is defined as its maximal univariate Stahel-Donoho outlyingness measured over all directions. To obtain the univariate Stahel-Donoho outlyingness in the direction \(v\), the dataset x is projected on \(v\), and the robustly standardized distance of \(v'z_i\) to the robust center of the projected data points x\(v\) is computed.

Usage

outlyingness(x, z = NULL, options = list())

Value

A list with components:

outlyingnessX: Vector of length \(n\) giving the outlyingness of the observations in x.
outlyingnessZ: Vector of length \(m\) giving the outlyingness of the points in z.
cutoff: Points whose outlyingness exceeds this cutoff can be considered as outliers with respect to x.
flagX: Observations of x whose outlyingness exceeds the cutoff value receive a flag FALSE, regular observations receive a flag TRUE.
flagZ: Points of z whose outlyingness exceeds the cutoff value receive a flag equal to FALSE, otherwise they receive a flag TRUE.
singularSubsets: When the input parameter type is equal to "Affine", the number of \(p\)-subsets that span a subspace of dimension smaller than \(p-1\). In such a case the orthogonal direction can not be uniquely determined. This is an indication that the data are not in general position. When the input parameter type is equal to "Rotation" it is possible that two randomly selected points of the data coincide due to ties in the data. In such a case this value signals how many times this happens.
dimension: When the data x are lying in a lower dimensional subspace, the dimension of this subspace.
hyperplane: When the data x are lying in a lower dimensional subspace, a direction orthogonal to this subspace. When a direction \(v\) is found such that the robust scale of \(xv\) is equal to zero, this equals \(v\).
inSubspace: When a direction \(v\) is found such that the robust scale of \(xv\) is zero, the observations from x which belong to the hyperplane orthogonal to \(v\) receive a value TRUE. The other observations receive a value FALSE.

Arguments

x

An \(n\) by \(p\) data matrix.

z

An optional \(m\) by \(p\) matrix containing rowwise the points \(z_i\) for which to compute the outlyingness. If z is not specified, it is set equal to x.

options

A list of available options:

type
Determines the desired type of invariance and should be one of "Affine", "Rotation" or "Shift". When the option "Affine" is used, the directions \(v\) are orthogonal to hyperplanes spanned by \(p\) observations from x. When the option "Rotation" is used, the directions pass by two randomly selected observations from x. With the option "Shift", directions are randomly generated.
Defaults to "Affine".
ndir
Determines the number of directions \(v\) by setting ndir to a specific number or to "all". In the latter case, an exhaustive search over all possible directions (according to type) is performed. If ndir is larger than the number of possible directions, the algorithm will automatically use this setting.
Defaults to \(250p\) when type="Affine", to 5000 when type="Rotation" and to 12500 when type="Shift".
stand
Determines how to robustly standardize the projected data: "MedMad" uses the median and the MAD, "unimcd" uses the univariate MCD of location and scale.
Defaults to "MedMad".
centered
When the data matrix x is already centered, no robust center should be computed in each direction. In that case, centered should be set to TRUE.
Defaults to FALSE.
h
When the input argument stand is equal to unimcd, the parameter h controls the number of data points that define the MCD estimator (see covMcd in the robustbase package). This value should lie between \([n/2]+1\) and \(n\).
Defaults to \([n/2]+1\).
seed
A strictly positive integer specifying the seed to be used by the C++ code.
Defaults to \(10\).

Author

P. Segaert using C++ code by K. Vakili and P. Segaert.

Details

The Stahel-Donoho outlyingness has been introduced by Stahel (1981) and Donoho (1982). It is mostly suited to measure the degree of outlyingness of multivariate points with respect to a data cloud from an elliptical distribution.

Depending on the dimension \(p\), different approximate algorithms are implemented. The affine invariant algorithm can only be used when \(n > p\). It draws ndir times at random \(p\) observations from x and considers the direction orthogonal to the hyperplane spanned by these \(p\) observations. At most \(p\) out of \(n\) directions can be considered. The orthogonal invariant version can be applied to high-dimensional data. It draws ndir times at random 2 observations from x and considers the direction through these observations. Here, at most 2 out of \(n\) directions can be considered. Finally, the shift invariant version randomly draws ndir vectors from the unit sphere.

The resulting Stahel-Donoho outlyingness values are invariant to affine transformations, rotations and shifts respectively provided that the seed is kept fixed at different runs of the algorithm. Note that the SDO values are guaranteed to increase when more directions are considered provided the seed is kept fixed, as this ensures that the random directions are generated in a fixed order.

An observation from x and z is flagged as an outlier if its SDO exceeds a cutoff value. This cutoff value is determined using the procedure in Rousseeuw et al. (2018). First, the logarithm of the SDO values is taken to render their distribution more symmetric, after which a normal approximation yields a cutoff on these values. The cutoff is then transformed back by applying the exponential function.

It is first checked whether the data lie in a subspace of dimension smaller than \(p\). If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it. Moreover, from the definition of the Stahel-Donoho outlyingness it follows that the outlyingness is ill-defined when the robust scale of the data projected on the direction \(v\) equals zero. In this case the algorithm will stop and give a warning. The returned values then include the direction \(v\) as well as an indicator specifying which of the observations of x belong to the hyperplane orthogonal to \(v\).

References

Stahel W.A. (1981). Robuste Schatzungen: infinitesimale Optimalitat und Schatzungen von Kovarianzmatrizen. PhD Thesis, ETH Zurich.

Donoho D.L. (1982). Breakdown properties of multivariate location estimators. Ph.D. Qualifying paper, Dept. Statistics, Harvard University, Boston.

Maronna R.A., Yohai V. (1995). The behavior of the Stahel-Donoho robust multivariate estimator. Journal of the American Statistical Association, 90, 330--341.

Rousseeuw P.J., Raymaekers J., Hubert M., (2018). A measure of directional outlyingness with applications to image data and video. Journal of Computational and Graphical Statistics, 27, 345--359.

Examples

Run this code

# Compute the outlyingness of a simple two-dimensional dataset.
# Outliers are plotted in red.

if (requireNamespace("robustbase", quietly = TRUE)) {
    BivData <- log(robustbase::Animals2)
} else {
  BivData <- matrix(rnorm(120), ncol = 2)
  BivData <- rbind(BivData, matrix(c(6,6,6,-2), ncol = 2))
}

Result <- outlyingness(x = BivData)
IndOutliers <- which(!Result$flagX)
plot(BivData)
points(BivData[IndOutliers,], col = "red")

# The number of directions may be specified through
# the option list. The resulting outlyingness is
# monotone increasing in the number of directions.
Result1 <- outlyingness(x = BivData,
                        options = list(ndir = 50)
                        )
Result2 <- outlyingness(x = BivData,
                        options = list(ndir = 100)
                        )
which(Result2$outlyingnessX - Result1$outlyingnessX < 0)
# This is however not the case when the seed is changed
Result1 <- outlyingness(x = BivData,
                        options = list(ndir = 50)
                        )
Result2 <- outlyingness(x = BivData,
                        options = list(ndir = 100,
                                       seed = 950)
                        )
plot(Result2$outlyingnessX - Result1$outlyingnessX,
     xlab = "Index", ylab = "Difference in outlyingness")

# We can also consider directions through two data
# points. If the sample is small enough one may opt
# to search over all choose(n,2) directions.
# Note that the computational load increases considerably
# as n becomes larger.
Result <- outlyingness(x = BivData,
                       options = list(type = "Rotation",
                                      ndir = "all")
                       )
IndOutliers <- which(!Result$flagX)
plot(BivData)
points(BivData[IndOutliers,], col = "red")

# Alternatively one may consider randomly generated directions.
Result <- outlyingness(x = BivData,
                       options = list(type = "Shift",
                                      ndir = 1000)
                      )
IndOutliers <- which(!Result$flagX)
plot(BivData)
points(BivData[IndOutliers,], col = "red")

# The default option of using the MAD for the scale may be
# changed to using the univariate mcd.
Result <- outlyingness(x = BivData,
                       options = list(type = "Affine",
                                      ndir = 1000,
                                      stand = "unimcd",
                                      h = 0.75*nrow(BivData))
                      )
IndOutliers <- which(!Result$flagX)
plot(BivData)
points(BivData[IndOutliers,], col = "red")

Run the code above in your browser using DataLab