optbin: Optimal Binning function

Description

Discretizes all numerical data in a data frame into categorical bins where the cut points are optimally aligned with the target categories, thereby a factor is returned. When building a OneR model this could result in fewer rules with enhanced accuracy.

Usage

optbin(x, ...)
# S3 method for formula
optbin(formula, data, method = c("logreg", "infogain",
  "naive"), na.omit = TRUE, ...)
# S3 method for data.frame
optbin(x, method = c("logreg", "infogain", "naive"),
  na.omit = TRUE, ...)

Arguments

data frame with the last column containing the target variable.

...

arguments passed to or from other methods.

formula

formula, additionally the argument data is needed.

data

data frame which contains the data, only needed when using the formula interface.

method

character string specifying the method for optimal binning, see 'Details'; can be abbreviated.

na.omit

logical value whether instances with missing values should be removed.

Value

A data frame with the target variable being in the last column.

Methods (by class)

formula: method for formulas.
data.frame: method for data frames.

Details

The cutpoints are calculated by pairwise logistic regressions (method "logreg"), information gain (method "infogain") or as the means of the expected values of the respective classes ("naive"). The function is likely to give unsatisfactory results when the distributions of the respective classes are not (linearly) separable. Method "naive" should only be used when distributions are (approximately) normal, although in this case "logreg" should give comparable results, so it is the preferable (and therefore default) method.

Method "infogain" is an entropy based method which calculates cut points based on information gain. The idea is that uncertainty is minimized by making the resulting bins as pure as possible. This method is the standard method of many decision tree algorithms.

Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. If the target is numeric it is turned into a factor with the number of levels equal to the number of values. Additionally a warning is given.

When "na.omit = FALSE" an additional level "NA" is added to each factor with missing values. If the target contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given.

References

https://github.com/vonjd/OneR

Examples

Run this code

data <- iris # without optimal binning
model <- OneR(data, verbose = TRUE)
summary(model)

data_opt <- optbin(iris) # with optimal binning
model_opt <- OneR(data_opt, verbose = TRUE)
summary(model_opt)

## The same with the formula interface:
data_opt <- optbin(Species ~., data = iris)
model_opt <- OneR(data_opt, verbose = TRUE)
summary(model_opt)

Run the code above in your browser using DataLab