Learn R Programming

quickcode (version 1.0.6)

detect_outlier: Detect Outliers in a Numeric Vector

Description

This function identifies outliers in a numeric vector using either the interquartile range (IQR) method or the z-score method. The IQR method defines outliers as values below Q1 - multiplier * IQR or above Q3 + multiplier * IQR, where Q1 and Q3 are the first and third quartiles. The z-score method identifies outliers as values with an absolute z-score exceeding a specified threshold.

Usage

detect_outlier(
  x,
  method = "iqr",
  multiplier = 1.5,
  z_threshold = 3,
  na.rm = TRUE,
  groups = NULL,
  summary = FALSE
)

iqr_outlier(x, multiplier)

zscore_outlier(x, z_threshold)

zscore_outlier2(x, z_threshold)

Value

If `groups = NULL` (default), a list with the following components: - `outliers`: A numeric vector of the outlier values. - `indices`: An integer vector of the indices where outliers occur in the input vector. - `bounds` (if `method = "iqr"`): A named numeric vector with the `lower` and `upper` bounds for outliers. - `threshold` (if `method = "zscore"`): A named numeric vector with the `lower` and `upper` z-score thresholds. - `is_outlier`: A logical vector of the same length as `x`, where `TRUE` indicates an outlier. - `summary` (if `summary = TRUE`): A list with summary statistics including the mean, median, standard deviation (for z-score), quartiles (for IQR), and number of outliers.

If `groups` is provided, a named list where each element corresponds to a unique group, containing the same components as above but computed separately for that group’s values.

Arguments

x

A numeric vector in which to detect outliers.

method

A character string specifying the outlier detection method. Options are `"iqr"` (default) for the interquartile range method or `"zscore"` for the z-score method.

multiplier

A positive numeric value specifying the multiplier for the IQR method. Default is `1.5`, typically used for moderate outliers; `3` is common for extreme outliers. Ignored if `method = "zscore"`.

z_threshold

A positive numeric value specifying the z-score threshold for the `method = "zscore"` option. Default is `3`, meaning values with an absolute z-score greater than 3 are flagged as outliers. Ignored if `method = "iqr"`.

na.rm

A logical value indicating whether to remove `NA` values before computation. Default is `TRUE`. If `FALSE` and `NA` values are present, the function stops with an error.

groups

An optional vector of group names or labels corresponding to each value in `x`. If provided, must be the same length as `x`. Outlier detection is performed separately for each unique group, and results are returned as a nested list. Default is `NULL` (no grouping).

summary

A logical value indicating whether to include a summary in the output. Default is `FALSE`. If `TRUE`, the output list includes a `summary` element with descriptive statistics and outlier counts, either overall or by group if `groups` is provided.

Details

The function returns a list containing the outliers, their indices, detection bounds or thresholds, and a logical vector indicating which elements are outliers. If a grouping vector is provided via `groups`, outlier detection is performed separately for each group, and results are returned as a nested list by group. If `na.rm = TRUE` (default), missing values (`NA`) are removed before computation. If `na.rm = FALSE` and `NA` values are present, the function stops with an error. The function also stops for non-numeric input, insufficient valid data, or mismatched group lengths.

The function requires at least two non-`NA` values per group (if `groups` is provided) or overall (if `groups = NULL`) to compute meaningful statistics when `na.rm = TRUE`. If `na.rm = FALSE`, the presence of `NA` values triggers an error. If all values in a group are identical or there are insufficient data points, an error is thrown for that group. The IQR method is robust to non-normal data, while the z-score method assumes approximate normality and is sensitive to extreme values.

Examples

Run this code
# Example 1: Basic IQR method without groups
x <- c(1, 2, 3, 4, 100)
detect_outlier(x)

# IQR method with summary
detect_outlier(x, summary = TRUE)

# Z-score method with custom threshold
y <- c(-10, 1, 2, 3, 4, 5, 20)
detect_outlier(y, method = "zscore", z_threshold = 2.5)

# Handling missing values
z <- c(1, 2, NA, 4, 5, 100)
detect_outlier(z, method = "iqr", multiplier = 3)

# Example 2: IQR method with groups
x2 <- c(1, 2, 3, 100, 5, 6, 7, 200)
groups2 <- c("A", "A", "A", "A", "B", "B", "B", "B")
detect_outlier(x2, groups = groups2)

# Example 3: Z-score method with groups and summary
x3 <- c(-10, 1, 2, 20, 3, 4, 5, 50)
groups3 <- c("X", "X", "X", "X", "Y", "Y", "Y", "Y")
detect_outlier(x3, method = "zscore", z_threshold = 2, groups = groups3, summary = TRUE)

# Example 4: IQR method with groups and NA values
x4 <- c(1, 2, NA, 100, 5, 6, 7, 200,1000)
groups4 <- c("G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2","G1")
detect_outlier(x4, groups = groups4)

# Error cases
if (FALSE) {
detect_outlier(c("a", "b"))  # Non-numeric input
detect_outlier(c(1), groups = c("A"))  # Insufficient data
detect_outlier(c(1, 2), groups = c("A"))  # Mismatched group length
detect_outlier(c(1, NA, 3), na.rm = FALSE)  # NA with na.rm = FALSE
}

Run the code above in your browser using DataLab