Learn R Programming

⚠️There's a newer version (1.0.2) of this package.Take me there.

themis

themis contains extra steps for the recipes package for dealing with unbalanced data. The name themis is that of the ancient Greek god who is typically depicted with a balance.

Installation

You can install the released version of themis from CRAN with:

install.packages("themis")

Install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("tidymodels/themis")

Example

Following is a example of using the SMOTE algorithm to deal with unbalanced data

library(recipes)
library(modeldata)
library(themis)

data("credit_data")

credit_data0 <- credit_data %>%
  filter(!is.na(Job))

count(credit_data0, Job)
#>         Job    n
#> 1     fixed 2805
#> 2 freelance 1024
#> 3    others  171
#> 4   partime  452

ds_rec <- recipe(Job ~ Time + Age + Expenses, data = credit_data0) %>%
  step_impute_mean(all_predictors()) %>%
  step_smote(Job, over_ratio = 0.25) %>%
  prep()

ds_rec %>%
  bake(new_data = NULL) %>%
  count(Job)
#> # A tibble: 4 × 2
#>   Job           n
#>   <fct>     <int>
#> 1 fixed      2805
#> 2 freelance  1024
#> 3 others      701
#> 4 partime     701

Methods

Below is some unbalanced data. Used for examples latter.

example_data <- data.frame(class = letters[rep(1:5, 1:5 * 10)],
                           x = rnorm(150))

library(ggplot2)

example_data %>%
  ggplot(aes(class)) +
  geom_bar()

Upsample / Over-sampling

The following methods all share the tuning parameter over_ratio, which is the ratio of the majority-to-minority frequencies.

namefunctionMulti-class
Random minority over-sampling with replacementstep_upsample():heavy_check_mark:
Synthetic Minority Over-sampling Techniquestep_smote():heavy_check_mark:
Borderline SMOTE-1step_bsmote(method = 1):heavy_check_mark:
Borderline SMOTE-2step_bsmote(method = 2):heavy_check_mark:
Adaptive synthetic sampling approach for imbalanced learningstep_adasyn():heavy_check_mark:
Generation of synthetic data by Randomly Over Sampling Examplesstep_rose()

By setting over_ratio = 1 you bring the number of samples of all minority classes equal to 100% of the majority class.

recipe(~., example_data) %>%
  step_upsample(class, over_ratio = 1) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  ggplot(aes(class)) +
  geom_bar()

and by setting over_ratio = 0.5 we upsample any minority class with less samples then 50% of the majority up to have 50% of the majority.

recipe(~., example_data) %>%
  step_upsample(class, over_ratio = 0.5) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  ggplot(aes(class)) +
  geom_bar()

Downsample / Under-sampling

Most of the the following methods all share the tuning parameter under_ratio, which is the ratio of the minority-to-majority frequencies.

namefunctionMulti-classunder_ratio
Random majority under-sampling with replacementstep_downsample():heavy_check_mark::heavy_check_mark:
NearMiss-1step_nearmiss():heavy_check_mark::heavy_check_mark:
Extraction of majority-minority Tomek linksstep_tomek()

By setting under_ratio = 1 you bring the number of samples of all majority classes equal to 100% of the minority class.

recipe(~., example_data) %>%
  step_downsample(class, under_ratio = 1) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  ggplot(aes(class)) +
  geom_bar()

and by setting under_ratio = 2 we downsample any majority class with more then 200% samples of the minority class down to have to 200% samples of the minority.

recipe(~., example_data) %>%
  step_downsample(class, under_ratio = 2) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  ggplot(aes(class)) +
  geom_bar()

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('themis')

Monthly Downloads

6,910

Version

1.0.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

July 2nd, 2022

Functions in themis (1.0.0)

required_pkgs.step_adasyn

S3 methods for tracking which additional packages are needed for steps.
smotenc

SMOTENC Algorithm
nearmiss

Remove Points Near Other Classes
step_bsmote

Apply borderline-SMOTE Algorithm
smote

SMOTE Algorithm
adasyn

Adaptive Synthetic Algorithm
bsmote

borderline-SMOTE Algorithm
step_adasyn

Apply Adaptive Synthetic Algorithm
reexports

Objects exported from other packages
circle_example

Synthetic Dataset With a Circle
tidy.step_adasyn

Tidy the Result of a Recipe
themis-package

themis: Extra Recipes Steps for Dealing with Unbalanced Data
step_upsample

Up-Sample a Data Set Based on a Factor Variable
step_downsample

Down-Sample a Data Set Based on a Factor Variable
step_smotenc

Apply SMOTENC algorithm
step_rose

Apply ROSE Algorithm
tomek

Remove Tomek's links
step_nearmiss

Remove Points Near Other Classes
step_tomek

Remove Tomek’s Links
step_smote

Apply SMOTE Algorithm
tunable.step_adasyn

tunable methods for themis