auto_grouping: Reduce cardinality in categorical variable by automatic grouping

Description

Reduce the cardinality of an input variable based on a target -binary by now- variable based on attribitues of accuracy and representativity, for both input and target variable. It uses a cluster model to create the new groups. Full documentation can be found at: https://livebook.datascienceheroes.com/data-preparation.html#high_cardinality_predictive_modeling

Usage

auto_grouping(data, input, target, n_groups, model = "kmeans", seed = 999)

Arguments

data

data frame source

input

categorical variable indicating

target

string of the variable to optimize the re-grouping

n_groups

number of groups for the new category based on input, normally between 3 and 10.

model

is the clustering model used to create the grouping, supported models: "kmeans" (default) or "hclust" (hierarchical clustering).

seed

optional, random number used internally for the k-means, changing this value will change the model

Value

A list containing 3 elements: recateg_results which contains the description of the target variable with the new groups; df_equivalence is a data frame containing the input category and the new category; fit_cluster which is the cluster model used to do the re-grouping

Examples

Run this code

# NOT RUN {
# Reducing quantity of countries based on has_flu variable
auto_grouping(data=data_country, input='country', target="has_flu", n_groups=8)
# }

Run the code above in your browser using DataLab