Usage
h2o.gbm(x, y, training_frame, model_id, checkpoint, ignore_const_cols = TRUE, distribution = c("AUTO", "gaussian", "bernoulli", "multinomial", "poisson", "gamma", "tweedie", "laplace", "quantile"), quantile_alpha = 0.5, tweedie_power = 1.5, ntrees = 50, max_depth = 5, min_rows = 10, learn_rate = 0.1, learn_rate_annealing = 1, sample_rate = 1, sample_rate_per_class, col_sample_rate = 1, col_sample_rate_change_per_level = 1, col_sample_rate_per_tree = 1, nbins = 20, nbins_top_level, nbins_cats = 1024, validation_frame = NULL, balance_classes = FALSE, class_sampling_factors, max_after_balance_size = 1, seed, build_tree_one_node = FALSE, nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"), keep_cross_validation_predictions = FALSE, keep_cross_validation_fold_assignment = FALSE, score_each_iteration = FALSE, score_tree_interval = 0, stopping_rounds = 0, stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "AUC", "r2", "misclassification", "mean_per_class_error"), stopping_tolerance = 0.001, max_runtime_secs = 0, offset_column = NULL, weights_column = NULL, min_split_improvement, histogram_type = c("AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin"), max_abs_leafnode_pred)
Arguments
x
A vector containing the names or indices of the predictor variables to use in building the GBM model.
y
The name or index of the response variable. If the data does not contain a header, this is the column index
number starting at 0, and increasing from left to right. (The response must be either an integer or a
categorical variable).
training_frame
An H2OFrame object containing the variables in the model.
model_id
(Optional) The unique id assigned to the resulting model. If
none is given, an id will automatically be generated.
checkpoint
"Model checkpoint (provide the model_id) to resume training with."
ignore_const_cols
A logical value indicating whether or not to ignore all the constant columns in the training frame.
distribution
A character
string. The distribution function of the response.
Must be "AUTO", "bernoulli", "multinomial", "poisson", "gamma", "tweedie", "laplace", "quantile" or "gaussian"
quantile_alpha
Quantile (only for Quantile regression, must be between 0 and 1)
tweedie_power
Tweedie power (only for Tweedie distribution, must be between 1 and 2)
ntrees
A nonnegative integer that determines the number of trees to grow.
max_depth
Maximum depth to grow the tree.
min_rows
Minimum number of rows to assign to teminal nodes.
learn_rate
Learning rate (from 0.0
to 1.0
)
learn_rate_annealing
Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999)
sample_rate
Row sample rate per tree (from 0.0
to 1.0
)
sample_rate_per_class
Row sample rate per tree per class (one per class, from 0.0
to 1.0
)
col_sample_rate
Column sample rate per split (from 0.0
to 1.0
)
col_sample_rate_change_per_level
Relative change of the column sampling rate for every level (from 0.0 to 2.0)
col_sample_rate_per_tree
Column sample rate per tree (from 0.0
to 1.0
)
nbins
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.
nbins_top_level
For numerical columns (real/int), build a histogram of (at most) this many bins at the root
level, then decrease by factor of two per level.
nbins_cats
For categorical columns (factors), build a histogram of this many bins, then split at the best point.
Higher values can lead to more overfitting.
validation_frame
An H2OFrame object indicating the validation dataset used to contruct the
confusion matrix. Defaults to NULL. If left as NULL, this defaults to the training data when nfolds = 0
.
balance_classes
logical, indicates whether or not to balance training data class
counts via over/under-sampling (for imbalanced data).
class_sampling_factors
Desired over/under-sampling ratios per class (in lexicographic
order). If not specified, sampling factors will be automatically computed to obtain class
balance during training. Requires balance_classes.
max_after_balance_size
Maximum relative size of the training data after balancing class counts (can be less
than 1.0). Ignored if balance_classes is FALSE, which is the default behavior.
seed
Seed for random numbers (affects sampling).
build_tree_one_node
Run on one node only; no network overhead but
fewer cpus used. Suitable for small datasets.
nfolds
(Optional) Number of folds for cross-validation.
fold_column
(Optional) Column with cross-validation fold index assignment per observation
fold_assignment
Cross-validation fold assignment scheme, if fold_column is not
specified, must be "AUTO", "Random", "Modulo", or "Stratified". The Stratified option will
stratify the folds based on the response variable, for classification problems.
keep_cross_validation_predictions
Whether to keep the predictions of the cross-validation models
keep_cross_validation_fold_assignment
Whether to keep the cross-validation fold assignment.
score_each_iteration
Attempts to score each tree.
score_tree_interval
Score the model after every so many trees. Disabled if set to 0.
stopping_rounds
Early stopping based on convergence of stopping_metric.
Stop if simple moving average of length k of the stopping_metric does not improve
(by stopping_tolerance) for k=stopping_rounds scoring events.
Can only trigger after at least 2k scoring events. Use 0 to disable.
stopping_metric
Metric to use for convergence checking, only for _stopping_rounds > 0
Can be one of "AUTO", "deviance", "logloss", "MSE", "AUC", "r2", "misclassification", or "mean_per_class_error".
stopping_tolerance
Relative tolerance for metric-based stopping criterion (if relative
improvement is not at least this much, stop)
max_runtime_secs
Maximum allowed runtime in seconds for model training. Use 0 to disable.
offset_column
Specify the offset column.
weights_column
Specify the weights column.
min_split_improvement
Minimum relative improvement in squared error reduction for a split to happen.
histogram_type
What type of histogram to use for finding optimal split points
Can be one of "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal" or "RoundRobin".
max_abs_leafnode_pred
Maximum absolute value of a leaf node prediction.