rand_forest: General Interface for Random Forest Models

Description

rand_forest() is a way to generate a specification of a model before fitting and allows the model to be created using different packages in R or via Spark. The main arguments for the model are:

mtry: The number of predictors that will be randomly sampled at each split when creating the tree models.
trees: The number of trees contained in the ensemble.
min_n: The minimum number of data points in a node that are required for the node to be split further.

These arguments are converted to their specific names at the time that the model is fit. Other options and argument can be set using set_engine(). If left to their defaults here (NULL), the values are taken from the underlying model functions. If parameters need to be modified, update() can be used in lieu of recreating the object from scratch.

Usage

rand_forest(mode = "unknown", mtry = NULL, trees = NULL, min_n = NULL)
# S3 method for rand_forest
update(
  object,
  parameters = NULL,
  mtry = NULL,
  trees = NULL,
  min_n = NULL,
  fresh = FALSE,
  ...
)

Arguments

mode

A single character string for the type of model. Possible values for this model are "unknown", "regression", or "classification".

mtry

An integer for the number of predictors that will be randomly sampled at each split when creating the tree models.

trees

An integer for the number of trees contained in the ensemble.

min_n

An integer for the minimum number of data points in a node that are required for the node to be split further.

object

A random forest model specification.

parameters

A 1-row tibble or named list with main parameters to update. If the individual arguments are used, these will supersede the values in parameters. Also, using engine arguments in this object will result in an error.

fresh

A logical for whether the arguments should be modified in-place of or replaced wholesale.

...

Not used for update().

Engine Details

Engines may have pre-set default arguments when executing the model fit call. For this type of model, the template of the fit calls are below:

ranger

rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Computational engine: ranger 
## 
## Model fit template:
## ranger::ranger(formula = missing_arg(), data = missing_arg(), 
##     case.weights = missing_arg(), num.threads = 1, verbose = FALSE, 
##     seed = sample.int(10^5, 1))

rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Computational engine: ranger 
## 
## Model fit template:
## ranger::ranger(formula = missing_arg(), data = missing_arg(), 
##     case.weights = missing_arg(), num.threads = 1, verbose = FALSE, 
##     seed = sample.int(10^5, 1), probability = TRUE)

Note that ranger::ranger() does not require factor predictors to be converted to indicator variables. fit() does not affect the encoding of the predictor values (i.e.<U+00A0>factors stay factors) for this model.

For ranger confidence intervals, the intervals are constructed using the form estimate +/- z * std_error. For classification probabilities, these values can fall outside of [0, 1] and will be coerced to be in this range.

randomForest

rand_forest() %>% 
  set_engine("randomForest") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Computational engine: randomForest 
## 
## Model fit template:
## randomForest::randomForest(x = missing_arg(), y = missing_arg())

rand_forest() %>% 
  set_engine("randomForest") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Computational engine: randomForest 
## 
## Model fit template:
## randomForest::randomForest(x = missing_arg(), y = missing_arg())

Note that randomForest::randomForest() does not require factor predictors to be converted to indicator variables. fit() does not affect the encoding of the predictor values (i.e.<U+00A0>factors stay factors) for this model.

spark

rand_forest() %>% 
  set_engine("spark") %>% 
  set_mode("regression") %>% 
  translate()

## Random Forest Model Specification (regression)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), 
##     type = "regression", seed = sample.int(10^5, 1))

rand_forest() %>% 
  set_engine("spark") %>% 
  set_mode("classification") %>% 
  translate()

## Random Forest Model Specification (classification)
## 
## Computational engine: spark 
## 
## Model fit template:
## sparklyr::ml_random_forest(x = missing_arg(), formula = missing_arg(), 
##     type = "classification", seed = sample.int(10^5, 1))

fit() does not affect the encoding of the predictor values (i.e.<U+00A0>factors stay factors) for this model.

Parameter translations

The standardized parameter names in parsnip can be mapped to their original names in each engine that has main parameters. Each engine typically has a different default value (shown in parentheses) for each parameter.

parsnip	ranger	randomForest	spark
mtry	mtry (see below)	mtry (see below)	feature_subset_strategy (see below)
trees	num.trees (500)	ntree (500)	num_trees (20)
min_n	min.node.size (see below)	nodesize (see below)	min_instances_per_node (1)

For randomForest and spark, the default mtry is the square root of the number of predictors for classification, and one-third of the predictors for regression.
For ranger, the default mtry is the square root of the number of predictors.
The default min_n for both ranger and randomForest is 1 for classification and 5 for regression.

Details

The model can be created using the fit() function using the following engines:

R: "ranger" (the default) or "randomForest"
Spark: "spark"

Examples

Run this code

# NOT RUN {
rand_forest(mode = "classification", trees = 2000)
# Parameters can be represented by a placeholder:
rand_forest(mode = "regression", mtry = varying())
model <- rand_forest(mtry = 10, min_n = 3)
model
update(model, mtry = 1)
update(model, mtry = 1, fresh = TRUE)
# }

Run the code above in your browser using DataLab