h2o.gbm: Gradient Boosted Machines

Description

Builds gradient boosted classification trees, and gradient boosted regression trees on a parsed data set.

Usage

h2o.gbm(x, y, training_frame, model_id, checkpoint, ignore_const_cols = TRUE, distribution = c("AUTO", "gaussian", "bernoulli", "multinomial", "poisson", "gamma", "tweedie", "laplace", "quantile"), quantile_alpha = 0.5, tweedie_power = 1.5, ntrees = 50, max_depth = 5, min_rows = 10, learn_rate = 0.1, learn_rate_annealing = 1, sample_rate = 1, sample_rate_per_class, col_sample_rate = 1, col_sample_rate_change_per_level = 1, col_sample_rate_per_tree = 1, nbins = 20, nbins_top_level, nbins_cats = 1024, validation_frame = NULL, balance_classes = FALSE, class_sampling_factors, max_after_balance_size = 1, seed, build_tree_one_node = FALSE, nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"), keep_cross_validation_predictions = FALSE, keep_cross_validation_fold_assignment = FALSE, score_each_iteration = FALSE, score_tree_interval = 0, stopping_rounds = 0, stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "AUC", "r2", "misclassification", "mean_per_class_error"), stopping_tolerance = 0.001, max_runtime_secs = 0, offset_column = NULL, weights_column = NULL, min_split_improvement, histogram_type = c("AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin"), max_abs_leafnode_pred)

Arguments

A vector containing the names or indices of the predictor variables to use in building the GBM model.

The name or index of the response variable. If the data does not contain a header, this is the column index number starting at 0, and increasing from left to right. (The response must be either an integer or a categorical variable).

training_frame

An H2OFrame object containing the variables in the model.

model_id

(Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

checkpoint

"Model checkpoint (provide the model_id) to resume training with."

ignore_const_cols

A logical value indicating whether or not to ignore all the constant columns in the training frame.

distribution

A character string. The distribution function of the response. Must be "AUTO", "bernoulli", "multinomial", "poisson", "gamma", "tweedie", "laplace", "quantile" or "gaussian"

quantile_alpha

Quantile (only for Quantile regression, must be between 0 and 1)

tweedie_power

Tweedie power (only for Tweedie distribution, must be between 1 and 2)

ntrees

A nonnegative integer that determines the number of trees to grow.

max_depth

Maximum depth to grow the tree.

min_rows

Minimum number of rows to assign to teminal nodes.

learn_rate

Learning rate (from 0.0 to 1.0)

learn_rate_annealing

Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999)

sample_rate

Row sample rate per tree (from 0.0 to 1.0)

sample_rate_per_class

Row sample rate per tree per class (one per class, from 0.0 to 1.0)

col_sample_rate

Column sample rate per split (from 0.0 to 1.0)

col_sample_rate_change_per_level

Relative change of the column sampling rate for every level (from 0.0 to 2.0)

col_sample_rate_per_tree

Column sample rate per tree (from 0.0 to 1.0)

nbins

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.

nbins_top_level

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.

nbins_cats

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

validation_frame

An H2OFrame object indicating the validation dataset used to contruct the confusion matrix. Defaults to NULL. If left as NULL, this defaults to the training data when nfolds = 0.

balance_classes

logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data).

class_sampling_factors

Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

max_after_balance_size

Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is FALSE, which is the default behavior.

seed

Seed for random numbers (affects sampling).

build_tree_one_node

Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

nfolds

(Optional) Number of folds for cross-validation.

fold_column

(Optional) Column with cross-validation fold index assignment per observation

fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified, must be "AUTO", "Random", "Modulo", or "Stratified". The Stratified option will stratify the folds based on the response variable, for classification problems.

keep_cross_validation_predictions

Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment

Whether to keep the cross-validation fold assignment.

score_each_iteration

Attempts to score each tree.

score_tree_interval

Score the model after every so many trees. Disabled if set to 0.

stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.

stopping_metric

Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of "AUTO", "deviance", "logloss", "MSE", "AUC", "r2", "misclassification", or "mean_per_class_error".

stopping_tolerance

Relative tolerance for metric-based stopping criterion (if relative improvement is not at least this much, stop)

max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable.

offset_column

Specify the offset column.

weights_column

Specify the weights column.

min_split_improvement

Minimum relative improvement in squared error reduction for a split to happen.

histogram_type

What type of histogram to use for finding optimal split points Can be one of "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal" or "RoundRobin".

max_abs_leafnode_pred

Maximum absolute value of a leaf node prediction.

Details

The default distribution function will guess the model type based on the response column type. In order to run properly, the response column must be an numeric for "gaussian" or an enum for "bernoulli" or "multinomial".

Examples

Run this code


library(h2o)
h2o.init()

# Run regression GBM on australia.hex data
ausPath <- system.file("extdata", "australia.csv", package="h2o")
australia.hex <- h2o.uploadFile(path = ausPath)
independent <- c("premax", "salmax","minairtemp", "maxairtemp", "maxsst",
                 "maxsoilmoist", "Max_czcs")
dependent <- "runoffnew"
h2o.gbm(y = dependent, x = independent, training_frame = australia.hex,
        ntrees = 3, max_depth = 3, min_rows = 2)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples