Perform regression or classification using random forests with a Spark DataFrame.
ml_random_forest(x, response, features, max.bins = 32L, max.depth = 5L,
num.trees = 20L, type = c("auto", "regression", "classification"),
ml.options = ml_options(), ...)
An object coercable to a Spark DataFrame (typically, a
tbl_spark
).
The name of the response vector (as a length-one character
vector), or a formula, giving a symbolic description of the model to be
fitted. When response
is a formula, it is used in preference to other
parameters to set the response
, features
, and intercept
parameters (if available). Currently, only simple linear combinations of
existing parameters is supposed; e.g. response ~ feature1 + feature2 + ...
.
The intercept term can be omitted by using - 1
in the model fit.
The name of features (terms) to use for the model fit.
The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.
Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree.
Number of trees to train (>= 1).
The type of model to fit. "regression"
treats the response
as a continuous variable, while "classification"
treats the response
as a categorical variable. When "auto"
is used, the model type is
inferred based on the response variable type -- if it is a numeric type,
then regression is used; classification otherwise.
Optional arguments, used to affect the model generated. See
ml_options
for more details.
Optional arguments; currently unused.
Other Spark ML routines: ml_als_factorization
,
ml_decision_tree
,
ml_generalized_linear_regression
,
ml_gradient_boosted_trees
,
ml_kmeans
, ml_lda
,
ml_linear_regression
,
ml_logistic_regression
,
ml_multilayer_perceptron
,
ml_naive_bayes
,
ml_one_vs_rest
, ml_pca
,
ml_survival_regression