Learn R Programming

UBL (version 0.0.9)

ImbC: Synthetic Imbalanced Data Set for a Multi-class Task

Description

Synthetic imbalanced data set for a multi-class task. The data set has a numeric feature ("X1"), a nominal feature ("X2") and a target class named "Class". The three classes of the problem ("normal", "rare1" and "rare2") are assigned according to the rules described below. These rules depend of the two features ("X1" and "X2").

Usage

data(ImbC)

Arguments

Format

The data set has one continuous feature (X1) and one nominal feature (X2). The target class (denoted as Class) has three possible values ("normal" , "rare1" and "rare2"). Classes "rare1" and "rare2" are the minority classes. Examples of class "rare1" occur in 1% of the data while those of class "rare2" occur in 13.1% of the data. The remaining class, "normal", is the majority class and occurs in about 85.9% of the data. Data set ImbC has 1000 examples distributed in classes "rare1", "rare2" and "normal" with 10, 131 and 859 examples respectively.

ImbC data has been simulated as follows:

-

X1\(\sim \mathbf{N} \left(0, 4\right)\)

-

X2 labels "cat", "fish" and "dog" where randomly distributed with the restriction of having a frequency of 30%, 30% and 40% respectively.

-

To obtain the target variable Class, we have define the following sets:

  • \(S_1=\{(X1, X2) : X1 > 9 \wedge (X2 \in \{"cat", "dog"\})\}\)

  • \(S_2=\{(X1, X2) : X1 > 7 \wedge X2 = "fish" \}\)

  • \(S_3=\{(X1, X2) :-1 < X1 < 0.5\}\)

  • \(S_4=\{(X1, X2) : X1 < -7 \wedge X2 = "fish"\}\)

-

The following conditions define the target variable distribution of the ImbC synthetic data set:

  • Assign class label "rare1" to: a random sample of 90% of set \(S_1\) and a random sample of 40% of set \(S_2\)

  • Assign class label "rare2" to: a random sample of 80% of set \(S_3\) and a random sample of 70% of set \(S_4\)

  • Assign class label "normal" to the remaing examples.

Author

Paula Branco paobranco@gmail.com, Rita Ribeiro rpribeiro@dcc.fc.up.pt and Luis Torgo ltorgo@dcc.fc.up.pt

Examples

Run this code
require(ggplot2)
data(ImbC)
summary(ImbC)
ggplot(data=ImbC, aes(x=X2, y=X1, color=Class))+geom_jitter()

Run the code above in your browser using DataLab