Usage
h2o.randomForest(x, y, training_frame, model_id, validation_frame = NULL, ignore_const_cols = TRUE, checkpoint, mtries = -1, col_sample_rate_change_per_level = 1, sample_rate = 0.632, sample_rate_per_class, col_sample_rate_per_tree = 1, build_tree_one_node = FALSE, ntrees = 50, max_depth = 20, min_rows = 1, nbins = 20, nbins_top_level, nbins_cats = 1024, binomial_double_trees = FALSE, balance_classes = FALSE, class_sampling_factors, max_after_balance_size = 5, seed, offset_column = NULL, weights_column = NULL, nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"), keep_cross_validation_predictions = FALSE, keep_cross_validation_fold_assignment = FALSE, score_each_iteration = FALSE, score_tree_interval = 0, stopping_rounds = 0, stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "AUC", "r2", "misclassification", "mean_per_class_error"), stopping_tolerance = 0.001, max_runtime_secs = 0, min_split_improvement, histogram_type = c("AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin"))
Arguments
x
A vector containing the names or indices of the predictor variables
to use in building the RF model.
y
The name or index of the response variable. If the data does not
contain a header, this is the column index number starting at 1, and
increasing from left to right. (The response must be either an integer
or a categorical variable).
training_frame
An H2OFrame object containing the
variables in the model.
model_id
(Optional) The unique id assigned to the resulting model. If
none is given, an id will automatically be generated.
validation_frame
An H2OFrame object containing the variables in the model. Default is NULL.
ignore_const_cols
A logical value indicating whether or not to ignore all the constant columns in the training frame.
checkpoint
"Model checkpoint (provide the model_id) to resume training with."
mtries
Number of variables randomly sampled as candidates at each split.
If set to -1, defaults to sqrtp for classification, and p/3 for regression,
where p is the number of predictors.
col_sample_rate_change_per_level
Relative change of the column sampling rate for every level (from 0.0 to 2.0)
sample_rate
Row sample rate per tree (from 0.0
to 1.0
)
sample_rate_per_class
Row sample rate per tree per class (one per class, from 0.0
to 1.0
)
col_sample_rate_per_tree
Column sample rate per tree (from 0.0
to 1.0
)
build_tree_one_node
Run on one node only; no network overhead but
fewer cpus used. Suitable for small datasets.
ntrees
A nonnegative integer that determines the number of trees to
grow.
max_depth
Maximum depth to grow the tree.
min_rows
Minimum number of rows to assign to teminal nodes.
nbins
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.
nbins_top_level
For numerical columns (real/int), build a histogram of (at most) this many bins at the root
level, then decrease by factor of two per level.
nbins_cats
For categorical columns (factors), build a histogram of this many bins, then split at the best point.
Higher values can lead to more overfitting.
binomial_double_trees
For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.
balance_classes
logical, indicates whether or not to balance training
data class counts via over/under-sampling (for imbalanced data)
class_sampling_factors
Desired over/under-sampling ratios per class (in lexicographic
order). If not specified, sampling factors will be automatically computed to obtain class
balance during training. Requires balance_classes.
max_after_balance_size
Maximum relative size of the training data after balancing class counts (can be less
than 1.0). Ignored if balance_classes is FALSE, which is the default behavior.
seed
Seed for random numbers (affects sampling) - Note: only
reproducible when running single threaded
offset_column
Specify the offset column.
weights_column
Specify the weights column.
nfolds
(Optional) Number of folds for cross-validation.
fold_column
(Optional) Column with cross-validation fold index assignment per observation
fold_assignment
Cross-validation fold assignment scheme, if fold_column is not
specified, must be "AUTO", "Random", "Modulo", or "Stratified". The Stratified option will
stratify the folds based on the response variable, for classification problems.
keep_cross_validation_predictions
Whether to keep the predictions of the cross-validation models
keep_cross_validation_fold_assignment
Whether to keep the cross-validation fold assignment.
score_each_iteration
Attempts to score each tree.
score_tree_interval
Score the model after every so many trees. Disabled if set to 0.
stopping_rounds
Early stopping based on convergence of stopping_metric.
Stop if simple moving average of length k of the stopping_metric does not improve
(by stopping_tolerance) for k=stopping_rounds scoring events.
Can only trigger after at least 2k scoring events. Use 0 to disable.
stopping_metric
Metric to use for convergence checking, only for _stopping_rounds > 0
Can be one of "AUTO", "deviance", "logloss", "MSE", "AUC", "r2", "misclassification", or "mean_per_class_error".
stopping_tolerance
Relative tolerance for metric-based stopping criterion (if relative
improvement is not at least this much, stop)
max_runtime_secs
Maximum allowed runtime in seconds for model training. Use 0 to disable.
min_split_improvement
Minimum relative improvement in squared error reduction for a split to happen.
histogram_type
What type of histogram to use for finding optimal split points
Can be one of "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal" or "RoundRobin". Note that H2O supports
extremely randomized trees with the "Random" option.
...
(Currently Unimplemented)