Learn R Programming

recipes (version 0.1.1)

discretize: Discretize Numeric Variables

Description

discretize converts a numeric vector into a factor with bins having approximately the same number of data points (based on a training set).

Usage

discretize(x, ...)

# S3 method for default discretize(x, ...)

# S3 method for numeric discretize(x, cuts = 4, labels = NULL, prefix = "bin", keep_na = TRUE, infs = TRUE, min_unique = 10, ...)

# S3 method for discretize predict(object, newdata, ...)

step_discretize(recipe, ..., role = NA, trained = FALSE, objects = NULL, options = list())

Arguments

x

A numeric vector

...

For discretize: options to pass to stats::quantile() that should not include x or probs. For step_discretize, the dots specify one or more selector functions to choose which variables are affected by the step. See selections() for more details. For the tidy method, these are not currently used.

cuts

An integer defining how many cuts to make of the data.

labels

A character vector defining the factor levels that will be in the new factor (from smallest to largest). This should have length cuts+1 and should not include a level for missing (see keep_na below).

prefix

A single parameter value to be used as a prefix for the factor levels (e.g. bin1, bin2, ...). If the string is not a valid R name, it is coerced to one.

keep_na

A logical for whether a factor level should be created to identify missing values in x.

infs

A logical indicating whether the smallest and largest cut point should be infinite.

min_unique

An integer defining a sample size line of dignity for the binning. If (the number of unique values)/(cuts+1) is less than min_unique, no discretization takes place.

object

An object of class discretize.

newdata

A new numeric object to be binned.

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the quantities for preprocessing have been estimated.

objects

The discretize() objects are stored here once the recipe has be trained by prep.recipe().

options

A list of options to discretize(). A defaults is set for the argument x. Note that the using the options prefix and labels when more than one variable is being transformed might be problematic as all variables inherit those values.

Value

discretize returns an object of class discretize and predict.discretize returns a factor vector. step_discretize returns an updated version of recipe with the new step added to the sequence of existing steps (if any). For the tidy method, a tibble with columns terms (the selectors or variables selected) and value (the breaks).

Details

discretize estimates the cut points from x using percentiles. For example, if cuts = 3, the function estimates the quartiles of x and uses these as the cut points. If cuts = 2, the bins are defined as being above or below the median of x.

The predict method can then be used to turn numeric vectors into factor vectors.

If keep_na = TRUE, a suffix of "_missing" is used as a factor level (see the examples below).

If infs = FALSE and a new value is greater than the largest value of x, a missing value will result.

Examples

Run this code
# NOT RUN {
data(biomass)

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

median(biomass_tr$carbon)
discretize(biomass_tr$carbon, cuts = 2)
discretize(biomass_tr$carbon, cuts = 2, infs = FALSE)
discretize(biomass_tr$carbon, cuts = 2, infs = FALSE, keep_na = FALSE)
discretize(biomass_tr$carbon, cuts = 2, prefix = "maybe a bad idea to bin")

carbon_binned <- discretize(biomass_tr$carbon)
table(predict(carbon_binned, biomass_tr$carbon))

carbon_no_infs <- discretize(biomass_tr$carbon, infs = FALSE)
predict(carbon_no_infs, c(50, 100))

rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
              data = biomass_tr)
rec <- rec %>% step_discretize(carbon, hydrogen)
rec <- prep(rec, biomass_tr)
binned_te <- bake(rec, biomass_te)
table(binned_te$carbon)
# }

Run the code above in your browser using DataLab