Learn R Programming

dataPreparation (version 0.4.3)

remove_sd_outlier: Standard deviation outlier filtering

Description

Remove outliers based on standard deviation thresholds. Only values within mean - sd * n_sigmas and mean + sd * n_sigmas are kept.

Usage

remove_sd_outlier(dataSet, cols = "auto", n_sigmas = 3, verbose = TRUE)

Arguments

dataSet

Matrix, data.frame or data.table

cols

List of numeric column(s) name(s) of dataSet to transform. To transform all numeric columns, set it to "auto". (character, default to "auto")

n_sigmas

number of times standard deviation is accepted (interger, default to 3)

verbose

Should the algorithm talk? (logical, default to TRUE)

Value

Same dataset with less rows, edited by reference. If you don't want to edit by reference please provide set dataSet = copy(dataSet).

Details

Filtering is made column by column, meaning that extrem values from first element of cols are removed, then extrem values from second element of cols are removed, ... So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.

Examples

Run this code
# NOT RUN {
# Given
library(data.table)
col_vals <- runif(1000)
col_mean <- mean(col_vals)
col_sd <- sd(col_vals)
extrem_val <- col_mean + 6 * col_sd
dataSet <- data.table(num_col = c(col_vals, extrem_val))

# When
dataSet <- remove_sd_outlier(dataSet, cols = "auto", n_sigmas = 3, verbose = TRUE)

# Then extrem value is no longer in set
extrem_val %in% dataSet[["num_col"]] # Is false
# }

Run the code above in your browser using DataLab