Learn R Programming

dataPreparation (version 0.4.3)

remove_rare_categorical: Filter rare categoricals

Description

Filter rows that have a rare occurences

Usage

remove_rare_categorical(
  dataSet,
  cols = "auto",
  threshold = 0.01,
  verbose = TRUE
)

Arguments

dataSet

Matrix, data.frame or data.table

cols

List of column(s) name(s) of dataSet to transform. To transform all columns, set it to "auto". (character, default to "auto")

threshold

share of occurencies under which row should be removed (numeric, default to 0.01)

verbose

Should the algorithm talk? (logical, default to TRUE)

Value

Same dataset with less rows, edited by reference. If you don't want to edit by reference please provide set dataSet = copy(dataSet).

Details

Filtering is made column by column, meaning that extrem values from first element of cols are removed, then extrem values from second element of cols are removed, ... So if filtering is perfomed on too many column, there ia high risk that a lot of rows will be dropped.

Examples

Run this code
# NOT RUN {
# Given a set with rare "C"
library(data.table)
dataSet <- data.table(cat_col = c(sample(c("A", "B"), 1000, replace=TRUE), "C"))

# When calling function
dataSet <- remove_rare_categorical(dataSet, cols = "cat_col",  
                                   threshold = 0.01, verbose = TRUE)
                                   
# Then there are no "C"
unique(dataSet[["cat_col"]])
# }

Run the code above in your browser using DataLab