Checks for and locates influential observations (i.e., "outliers") via several distance and/or clustering methods. If several methods are selected, the returned "Outlier" vector will be a composite outlier score, made of the average of the binary (0 or 1) results of each method. It represents the probability of each observation of being classified as an outlier by at least one method. The decision rule used by default is to classify as outliers observations which composite outlier score is superior or equal to 0.5 (i.e., that were classified as outliers by at least half of the methods). See the Details section below for a description of the methods.
check_outliers(x, ...)# S3 method for default
check_outliers(
x,
method = c("cook", "pareto"),
threshold = NULL,
ID = NULL,
verbose = TRUE,
...
)
# S3 method for numeric
check_outliers(x, method = "zscore_robust", threshold = NULL, ...)
# S3 method for data.frame
check_outliers(x, method = "mahalanobis", threshold = NULL, ID = NULL, ...)
# S3 method for performance_simres
check_outliers(
x,
type = "default",
iterations = 100,
alternative = "two.sided",
...
)
A logical vector of the detected outliers with a nice printing method: a check (message) on whether outliers were detected or not. The information on the distance measure and whether or not an observation is considered as outlier can be recovered with the as.data.frame
function. Note that the function will (silently) return a vector of FALSE
for non-supported data types such as character strings.
A model, a data.frame, a performance_simres
simulate_residuals()
or a DHARMa
object.
When method = "ics"
, further arguments in ...
are passed
down to ICSOutlier::ics.outlier()
. When method = "mahalanobis"
,
they are passed down to stats::mahalanobis()
. percentage_central
can
be specified when method = "mcd"
. For objects of class performance_simres
or DHARMa
, further arguments are passed down to DHARMa::testOutliers()
.
The outlier detection method(s). Can be "all"
or some of
"cook"
, "pareto"
, "zscore"
, "zscore_robust"
, "iqr"
, "ci"
, "eti"
,
"hdi"
, "bci"
, "mahalanobis"
, "mahalanobis_robust"
, "mcd"
, "ics"
,
"optics"
or "lof"
.
A list containing the threshold values for each method (e.g.
list('mahalanobis' = 7, 'cook' = 1)
), above which an observation is
considered as outlier. If NULL
, default values will be used (see
'Details'). If a numeric value is given, it will be used as the threshold
for any of the method run.
Optional, to report an ID column along with the row number.
Toggle warnings.
Type of method to test for outliers. Can be one of "default"
,
"binomial"
or "bootstrap"
. Only applies when x
is an object returned
by simulate_residuals()
or of class DHARMa
. See 'Details' in
?DHARMa::testOutliers
for a detailed description of the types.
Number of simulations to run.
A character string specifying the alternative hypothesis.
Cook's Distance:
Among outlier detection methods, Cook's distance and leverage are less
common than the basic Mahalanobis distance, but still used. Cook's distance
estimates the variations in regression coefficients after removing each
observation, one by one (Cook, 1977). Since Cook's distance is in the metric
of an F distribution with p and n-p degrees of freedom, the median point of
the quantile distribution can be used as a cut-off (Bollen, 1985). A common
approximation or heuristic is to use 4 divided by the numbers of
observations, which usually corresponds to a lower threshold (i.e., more
outliers are detected). This only works for frequentist models. For Bayesian
models, see pareto
.
Pareto:
The reliability and approximate convergence of Bayesian models can be
assessed using the estimates for the shape parameter k of the generalized
Pareto distribution. If the estimated tail shape parameter k exceeds 0.5, the
user should be warned, although in practice the authors of the loo
package observed good performance for values of k up to 0.7 (the default
threshold used by performance
).
Z-scores ("zscore", "zscore_robust")
:
The Z-score, or standard score, is a way of describing a data point as
deviance from a central value, in terms of standard deviations from the mean
("zscore"
) or, as it is here the case ("zscore_robust"
) by
default (Iglewicz, 1993), in terms of Median Absolute Deviation (MAD) from
the median (which are robust measures of dispersion and centrality). The
default threshold to classify outliers is 1.959 (threshold = list("zscore" = 1.959)
),
corresponding to the 2.5% (qnorm(0.975)
) most extreme observations
(assuming the data is normally distributed). Importantly, the Z-score
method is univariate: it is computed column by column. If a data frame is
passed, the Z-score is calculated for each variable separately, and the
maximum (absolute) Z-score is kept for each observations. Thus, all
observations that are extreme on at least one variable might be detected
as outliers. Thus, this method is not suited for high dimensional data
(with many columns), returning too liberal results (detecting many outliers).
IQR ("iqr")
:
Using the IQR (interquartile range) is a robust method developed by John
Tukey, which often appears in box-and-whisker plots (e.g., in
ggplot2::geom_boxplot). The interquartile range is the range between the first
and the third quartiles. Tukey considered as outliers any data point that
fell outside of either 1.5 times (the default threshold is 1.7) the IQR below
the first or above the third quartile. Similar to the Z-score method, this is
a univariate method for outliers detection, returning outliers detected for
at least one column, and might thus not be suited to high dimensional data.
The distance score for the IQR is the absolute deviation from the median of
the upper and lower IQR thresholds. Then, this value is divided by the IQR
threshold, to “standardize” it and facilitate interpretation.
CI ("ci", "eti", "hdi", "bci")
:
Another univariate method is to compute, for each variable, some sort of
"confidence" interval and consider as outliers values lying beyond the edges
of that interval. By default, "ci"
computes the Equal-Tailed Interval
("eti"
), but other types of intervals are available, such as Highest
Density Interval ("hdi"
) or the Bias Corrected and Accelerated
Interval ("bci"
). The default threshold is 0.95
, considering
as outliers all observations that are outside the 95% CI on any of the
variable. See bayestestR::ci()
for more details
about the intervals. The distance score for the CI methods is the absolute
deviation from the median of the upper and lower CI thresholds. Then, this
value is divided by the difference between the upper and lower CI bounds
divided by two, to “standardize” it and facilitate interpretation.
Mahalanobis Distance:
Mahalanobis distance (Mahalanobis, 1930) is often used for multivariate
outliers detection as this distance takes into account the shape of the
observations. The default threshold
is often arbitrarily set to some
deviation (in terms of SD or MAD) from the mean (or median) of the
Mahalanobis distance. However, as the Mahalanobis distance can be
approximated by a Chi squared distribution (Rousseeuw and Van Zomeren, 1990),
we can use the alpha quantile of the chi-square distribution with k degrees
of freedom (k being the number of columns). By default, the alpha threshold
is set to 0.025 (corresponding to the 2.5\
Cabana, 2019). This criterion is a natural extension of the median plus or
minus a coefficient times the MAD method (Leys et al., 2013).
Robust Mahalanobis Distance:
A robust version of Mahalanobis distance using an Orthogonalized
Gnanadesikan-Kettenring pairwise estimator (Gnanadesikan and Kettenring,
1972). Requires the bigutilsr package. See the bigutilsr::dist_ogk()
function.
Minimum Covariance Determinant (MCD):
Another robust version of Mahalanobis. Leys et al. (2018) argue that
Mahalanobis Distance is not a robust way to determine outliers, as it uses
the means and covariances of all the data - including the outliers - to
determine individual difference scores. Minimum Covariance Determinant
calculates the mean and covariance matrix based on the most central subset of
the data (by default, 66\
is deemed to be a more robust method of identifying and removing outliers
than regular Mahalanobis distance.
This method has a percentage_central
argument that allows specifying
the breakdown point (0.75, the default, is recommended by Leys et al. 2018,
but a commonly used alternative is 0.50).
Invariant Coordinate Selection (ICS):
The outlier are detected using ICS, which by default uses an alpha threshold
of 0.025 (corresponding to the 2.5\
value for outliers classification. Refer to the help-file of
ICSOutlier::ics.outlier()
to get more details about this procedure.
Note that method = "ics"
requires both ICS and ICSOutlier
to be installed, and that it takes some time to compute the results. You
can speed up computation time using parallel computing. Set the number of
cores to use with options(mc.cores = 4)
(for example).
OPTICS:
The Ordering Points To Identify the Clustering Structure (OPTICS) algorithm
(Ankerst et al., 1999) is using similar concepts to DBSCAN (an unsupervised
clustering technique that can be used for outliers detection). The threshold
argument is passed as minPts
, which corresponds to the minimum size
of a cluster. By default, this size is set at 2 times the number of columns
(Sander et al., 1998). Compared to the other techniques, that will always
detect several outliers (as these are usually defined as a percentage of
extreme values), this algorithm functions in a different manner and won't
always detect outliers. Note that method = "optics"
requires the
dbscan package to be installed, and that it takes some time to compute
the results.
Local Outlier Factor:
Based on a K nearest neighbors algorithm, LOF compares the local density of
a point to the local densities of its neighbors instead of computing a
distance from the center (Breunig et al., 2000). Points that have a
substantially lower density than their neighbors are considered outliers. A
LOF score of approximately 1 indicates that density around the point is
comparable to its neighbors. Scores significantly larger than 1 indicate
outliers. The default threshold of 0.025 will classify as outliers the
observations located at qnorm(1-0.025) * SD)
of the log-transformed
LOF distance. Requires the dbscan package.
The approach for detecting outliers based on simulated residuals differs
from the traditional methods and may not be detecting outliers as expected.
Literally, this approach compares observed to simulated values. However, we
do not know the deviation of the observed data to the model expectation, and
thus, the term "outlier" should be taken with a grain of salt. It refers to
"simulation outliers". Basically, the comparison tests whether on observed
data point is outside the simulated range. It is strongly recommended to read
the related documentations in the DHARMa package, e.g. ?DHARMa::testOutliers
.
Default thresholds are currently specified as follows:
list(
zscore = stats::qnorm(p = 1 - 0.001 / 2),
zscore_robust = stats::qnorm(p = 1 - 0.001 / 2),
iqr = 1.7,
ci = 1 - 0.001,
eti = 1 - 0.001,
hdi = 1 - 0.001,
bci = 1 - 0.001,
cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
pareto = 0.7,
mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mahalanobis_robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
ics = 0.001,
optics = 2 * ncol(x),
lof = 0.001
)
For meta-analysis models (e.g. objects of class rma
from the metafor
package or metagen
from package meta), studies are defined as outliers
when their confidence interval lies outside the confidence interval of the
pooled effect.
Outliers can be defined as particularly influential observations. Most methods rely on the computation of some distance metric, and the observations greater than a certain threshold are considered outliers. Importantly, outliers detection methods are meant to provide information to consider for the researcher, rather than to be an automatized procedure which mindless application is a substitute for thinking.
An example sentence for reporting the usage of the composite method could be:
"Based on a composite outlier score (see the 'check_outliers' function in the 'performance' R package; Lüdecke et al., 2021) obtained via the joint application of multiple outliers detection algorithms (Z-scores, Iglewicz, 1993; Interquartile range (IQR); Mahalanobis distance, Cabana, 2019; Robust Mahalanobis distance, Gnanadesikan and Kettenring, 1972; Minimum Covariance Determinant, Leys et al., 2018; Invariant Coordinate Selection, Archimbaud et al., 2018; OPTICS, Ankerst et al., 1999; Isolation Forest, Liu et al. 2008; and Local Outlier Factor, Breunig et al., 2000), we excluded n participants that were classified as outliers by at least half of the methods used."
Archimbaud, A., Nordhausen, K., and Ruiz-Gazen, A. (2018). ICS for multivariate outlier detection with application to quality control. Computational Statistics and Data Analysis, 128, 184-199. tools:::Rd_expr_doi("10.1016/j.csda.2018.06.011")
Gnanadesikan, R., and Kettenring, J. R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 81-124.
Bollen, K. A., and Jackman, R. W. (1985). Regression diagnostics: An expository treatment of outliers and influential cases. Sociological Methods and Research, 13(4), 510-542.
Cabana, E., Lillo, R. E., and Laniado, H. (2019). Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. arXiv preprint arXiv:1904.02596.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15-18.
Iglewicz, B., and Hoaglin, D. C. (1993). How to detect and handle outliers (Vol. 16). Asq Press.
Leys, C., Klein, O., Dominicy, Y., and Ley, C. (2018). Detecting multivariate outliers: Use a robust variant of Mahalanobis distance. Journal of Experimental Social Psychology, 74, 150-156.
Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008, December). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining (pp. 413-422). IEEE.
Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., and Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60), 3139. tools:::Rd_expr_doi("10.21105/joss.03139")
Thériault, R., Ben-Shachar, M. S., Patil, I., Lüdecke, D., Wiernik, B. M., and Makowski, D. (2023). Check your outliers! An introduction to identifying statistical outliers in R with easystats. Behavior Research Methods, 1-11. tools:::Rd_expr_doi("10.3758/s13428-024-02356-w")
Rousseeuw, P. J., and Van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical association, 85(411), 633-639.
see::plot.see_check_outliers()
for options to customize the plot.
Other functions to check model assumptions and and assess model quality:
check_autocorrelation()
,
check_collinearity()
,
check_convergence()
,
check_heteroscedasticity()
,
check_homogeneity()
,
check_model()
,
check_overdispersion()
,
check_predictions()
,
check_singularity()
,
check_zeroinflation()