simple_bin: Discretize variables in your training and test datasets
Description
Function to apply simple equal-width or equal-height binning to columns of a
training dataset, and then optionally bin the columns of a test set into bins
with the appropriate cutpoints
Usage
simple_bin(train, test = NULL, exclude_vars = NULL, include_vars = NULL, bins, type = "height", na_include = TRUE)
Arguments
train
training set
test
test set
exclude_vars
variables to exclude (e.g. the target, or the row ID)
include_vars
if you only want certain variables binned, you may specify them
directly instead of excluding all other variables
bins
single number specifying the number of bins to create on each variable,
or a named list specifying cut-points for each variable
type
if bins is given as a number, then this determines whether to create
bins with equal number of observations ("height") or of equal width
("width")
na_include
logical. Give missing values their own bin?
Value
if test is not NULL, a list containing two tbl_df objects, with appropriate
columns replaced by their binned values and all other columns unchanged
if test is NULL, returns the training set portion of the list
Details
This function was built as a convenience, to automate the process of binning
continuous variables into disrete levels, and also to provide a simple,
interpretible, unambiguous method of dealing with missing values in data
science problems.