Learn R Programming

OneR (version 2.2)

optbin: Optimal Binning function

Description

Discretizes all numerical data in a data frame into categorical bins where the cut points are optimally aligned with the target categories, thereby a factor is returned. When building a OneR model this could result in fewer rules with enhanced accuracy.

Usage

optbin(x, ...)

# S3 method for formula optbin(formula, data, method = c("logreg", "infogain", "naive"), na.omit = TRUE, ...)

# S3 method for data.frame optbin(x, method = c("logreg", "infogain", "naive"), na.omit = TRUE, ...)

Arguments

x
data frame with the last column containing the target variable.
...
arguments passed to or from other methods.
formula
formula, additionally the argument data is needed.
data
data frame which contains the data, only needed when using the formula interface.
method
character string specifying the method for optimal binning, see 'Details'; can be abbreviated.
na.omit
logical value whether instances with missing values should be removed.

Value

A data frame with the target variable being in the last column.

Methods (by class)

  • formula: method for formulas.

  • data.frame: method for data frames.

Details

The cutpoints are calculated by pairwise logistic regressions (method "logreg"), information gain (method "infogain") or as the means of the expected values of the respective classes ("naive"). The function is likely to give unsatisfactory results when the distributions of the respective classes are not (linearly) separable. Method "naive" should only be used when distributions are (approximately) normal, although in this case "logreg" should give comparable results, so it is the preferable (and therefore default) method.

Method "infogain" is an entropy based method which calculates cut points based on information gain. The idea is that uncertainty is minimized by making the resulting bins as pure as possible. This method is the standard method of many decision tree algorithms.

Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. If the target is numeric it is turned into a factor with the number of levels equal to the number of values. Additionally a warning is given.

When "na.omit = FALSE" an additional level "NA" is added to each factor with missing values. If the target contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given.

References

https://github.com/vonjd/OneR

See Also

OneR, bin

Examples

Run this code
data <- iris # without optimal binning
model <- OneR(data, verbose = TRUE)
summary(model)

data_opt <- optbin(iris) # with optimal binning
model_opt <- OneR(data_opt, verbose = TRUE)
summary(model_opt)

## The same with the formula interface:
data_opt <- optbin(Species ~., data = iris)
model_opt <- OneR(data_opt, verbose = TRUE)
summary(model_opt)

Run the code above in your browser using DataLab