decision_tree: General Interface for Decision Tree Models

Description

decision_tree() is a way to generate a specification of a model before fitting and allows the model to be created using different packages in R or via Spark. The main arguments for the model are:

cost_complexity: The cost/complexity parameter (a.k.a. Cp) used by CART models (rpart only).
tree_depth: The maximum depth of a tree (rpart and spark only).
min_n: The minimum number of data points in a node that are required for the node to be split further.

These arguments are converted to their specific names at the time that the model is fit. Other options and arguments can be set using set_engine(). If left to their defaults here (NULL), the values are taken from the underlying model functions. If parameters need to be modified, update() can be used in lieu of recreating the object from scratch.

Usage

decision_tree(
  mode = "unknown",
  cost_complexity = NULL,
  tree_depth = NULL,
  min_n = NULL
)
# S3 method for decision_tree
update(
  object,
  parameters = NULL,
  cost_complexity = NULL,
  tree_depth = NULL,
  min_n = NULL,
  fresh = FALSE,
  ...
)

Arguments

mode

A single character string for the type of model. Possible values for this model are "unknown", "regression", or "classification".

cost_complexity

A positive number for the the cost/complexity parameter (a.k.a. Cp) used by CART models (rpart only).

tree_depth

An integer for maximum depth of the tree.

min_n

An integer for the minimum number of data points in a node that are required for the node to be split further.

object

A decision tree model specification.

parameters

A 1-row tibble or named list with main parameters to update. If the individual arguments are used, these will supersede the values in parameters. Also, using engine arguments in this object will result in an error.

fresh

A logical for whether the arguments should be modified in-place of or replaced wholesale.

...

Not used for update().

Engine Details

Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:

rpart

decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("regression") %>% 
  translate()

## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart 
## 
## Model fit template:
## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification") %>% 
  translate()

## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart 
## 
## Model fit template:
## rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

Note that rpart::rpart() does not require factor predictors to be converted to indicator variables. fit() does not affect the encoding of the predictor values (i.e.<U+00A0>factors stay factors) for this model

C5.0

decision_tree() %>% 
  set_engine("C5.0") %>% 
  set_mode("classification") %>% 
  translate()

## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0 
## 
## Model fit template:
## parsnip::C5.0_train(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     trials = 1)

Note that C50::C5.0() does not require factor predictors to be converted to indicator variables. fit() does not affect the encoding of the predictor values (i.e.<U+00A0>factors stay factors) for this model

spark

decision_tree() %>% 
  set_engine("spark") %>% 
  set_mode("regression") %>% 
  translate()

## Decision Tree Model Specification (regression)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_decision_tree_regressor(x = missing_arg(), formula = missing_arg(), 
##     seed = sample.int(10^5, 1))

decision_tree() %>% 
  set_engine("spark") %>% 
  set_mode("classification") %>% 
  translate()

## Decision Tree Model Specification (classification)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_decision_tree_classifier(x = missing_arg(), formula = missing_arg(), 
##     seed = sample.int(10^5, 1))

fit() does not affect the encoding of the predictor values (i.e.<U+00A0>factors stay factors) for this model

Parameter translations

The standardized parameter names in parsnip can be mapped to their original names in each engine that has main parameters. Each engine typically has a different default value (shown in parentheses) for each parameter.

parsnip	rpart	C5.0	spark
tree_depth	maxdepth (30)	NA	max_depth (5)
min_n	minsplit (20)	minCases (2)	min_instances_per_node (1)
cost_complexity	cp (0.01)	NA	NA

Details

The model can be created using the fit() function using the following engines:

R: "rpart" (the default) or "C5.0" (classification only)
Spark: "spark"

Note that, for rpart models, but cost_complexity and tree_depth can be both be specified but the package will give precedence to cost_complexity. Also, tree_depth values greater than 30 rpart will give nonsense results on 32-bit machines.

Examples

Run this code

# NOT RUN {
show_engines("decision_tree")

decision_tree(mode = "classification", tree_depth = 5)
# Parameters can be represented by a placeholder:
decision_tree(mode = "regression", cost_complexity = varying())
model <- decision_tree(cost_complexity = 10, min_n = 3)
model
update(model, cost_complexity = 1)
update(model, cost_complexity = 1, fresh = TRUE)
# }

Run the code above in your browser using DataLab