OSSClassif: One-sided selection strategy for handling multiclass imbalanced problems.

Description

This function performs an adapted one-sided selection strategy for multiclass imbalanced problems.

Usage

OSSClassif(form, dat, dist = "Euclidean", p = 2, Cl = "smaller", start = "CNN")

Value

The function returns a data frame with the new data set resulting from the application of the selected OSS strategy.

Arguments

form: A formula describing the prediction problem.
dat: A data frame containing the original imbalanced data set.
dist: A character string indicating which distance metric to use when determining the k nearest neighbors. See the details. Defaults to "Euclidean".
p: A number indicating the value of p if the "p-norm" distance is chosen. Only necessary to define if a "p-norm" is chosen in the dist argument. See details.
Cl: A character vector indicating which are the most important classes. Defaults to "smaller" which means that the smaller classes are automatically determined. In this case, all the smaller classes are those with a frequency below (nr.examples)/(nr.classes). With the selection of option "smaller" those classes are the ones considered important for the user.
start: A string which determines which strategy (CNN or Tomek links) should be performed first. The existing options are "CNN" and "Tomek". The first one, "CNN", which is the default, means that CNN strategy will be performed first and Tomek links are applied after. On the other hand, if start is set to "Tomek" then the reverse order is applied (first Tomek links and after CNN strategy).

Author

Paula Branco paobranco@gmail.com, Rita Ribeiro rpribeiro@dcc.fc.up.pt and Luis Torgo ltorgo@dcc.fc.up.pt

Details

dist parameter:

The parameter dist allows the user to define the distance metric to be used in the neighbors computation. Although the default is the Euclidean distance, other metrics are available. This allows the computation of distances in data sets with, for instance, both nominal and numeric features. The options available for the distance functions are as follows:

- for data with only numeric features: "Manhattan", "Euclidean", "Canberra", "Chebyshev", "p-norm";

- for data with only nominal features: "Overlap";

- for dealing with both nominal and numeric features: "HEOM", "HVDM".

When the "p-norm" is selected for the dist parameter, it is also necessary to define the value of parameter p. The value of parameter p sets which "p-norm" will be used. For instance, if p is set to 1, the "1-norm" (or Manhattan distance) is used, and if p is set to 2, the "2-norm" (or Euclidean distance) is applied. For more details regarding the distance functions implemented in UBL package please see the package vignettes.

References

Kubat, M. & Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One-Sided Selection Proc. of the 14th Int. Conf. on Machine Learning, Morgan Kaufmann, 179-186.

Batista, G. E.; Prati, R. C. & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data ACM SIGKDD Explorations Newsletter, ACM, 6, 20-29

Examples

Run this code

  if (FALSE) {
    if (requireNamespace("DMwR2", quietly = TRUE)) {
  data(algae, package ="DMwR2")
  clean.algae <- data.frame(algae[complete.cases(algae), ])
  alg1 <- OSSClassif(season~., clean.algae, dist = "HVDM", 
                     Cl = c("spring", "summer"))
  alg2 <- OSSClassif(season~., clean.algae, dist = "HEOM", 
                     Cl = c("spring", "summer"), start = "Tomek")
  alg3 <- OSSClassif(season~., clean.algae, dist = "HVDM", start = "CNN")
  alg4 <- OSSClassif(season~., clean.algae, dist = "HVDM", start = "Tomek")
  alg5 <- OSSClassif(season~., clean.algae, dist = "HEOM", Cl = "winter")
  summary(alg1$season)
  summary(alg2$season)
  summary(alg3$season)
  summary(alg4$season)
  summary(alg5$season)
  }
  }

Run the code above in your browser using DataLab