h2o.gbm: H2O: Gradient Boosted Machines


Builds gradient boosted classification trees, and gradient boosed regression trees on a parsed data set.


h2o.gbm(x, y, distribution = "multinomial", data, key = "", n.trees = 10, 
  interaction.depth = 5, n.minobsinnode = 10, shrinkage = 0.1, n.bins = 20,
  importance = FALSE, nfolds = 0, validation, balance.classes = FALSE, 
  max.after.balance.size = 5)


A vector containing the names or indices of the predictor variables to use in building the GBM model.
The name or index of the response variable. If the data does not contain a header, this is the column index number starting at 0, and increasing from left to right. (The response must be either an integer or a categorical variable).
The type of GBM model to be produced: classification is "multinomial" (default), "gaussian" is used for regression.
An H2OParsedData object containing the variables in the model.
(Optional) The unique hex key assigned to the resulting model. If none is given, a key will automatically be generated.
(Optional) Number of trees to grow. Must be a nonnegative integer.
(Optional) Maximum depth to grow the tree.
(Optional) Minimum number of rows to assign to teminal nodes.
(Optional) A learning-rate parameter defining step size reduction.
(Optional) Number of bins to use in building histogram.
(Optional) A logical value indicating whether variable importance should be calculated. This will increase the amount of time for the algorithm to complete.
(Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
(Optional) An H2OParsedData object indicating the validation dataset used to construct confusion matrix. If left blank, this defaults to the training data when nfolds = 0.
(Optional) Balance training data class counts via over/under-sampling (for imbalanced data)
Maximum relative size of the training data after balancing class counts (can be less than 1.0)


  • An object of class H2OGBMModel with slots key, data, valid (the validation dataset) and model, where the last is a list of the following components:
  • typeThe type of the tree.
  • n.treesNumber of trees grown.
  • oob_errOut of bag error rate.
  • forestA matrix giving the minimum, mean, and maximum of the tree depth and number of leaves.
  • confusionConfusion matrix of the prediction when classification model is specified.


# -- CRAN examples begin --
localH2O = h2o.init()

# Run regression GBM on australia.hex data 
ausPath = system.file("extdata", "australia.csv", package="h2o")
australia.hex = h2o.importFile(localH2O, path = ausPath)
independent <- c("premax", "salmax","minairtemp", "maxairtemp", "maxsst", 
  "maxsoilmoist", "Max_czcs")
dependent <- "runoffnew"
h2o.gbm(y = dependent, x = independent, data = australia.hex, n.trees = 3, interaction.depth = 3, 
  n.minobsinnode = 2, shrinkage = 0.2, distribution= "gaussian")
# -- CRAN examples end --

# Run multinomial classification GBM on australia data 
h2o.gbm(y = dependent, x = independent, data = australia.hex, n.trees = 3, interaction.depth = 3, 
  n.minobsinnode = 2, shrinkage = 0.01, distribution= "multinomial")

# GBM variable importance
# Also see:
#   https://github.com/0xdata/h2o/blob/master/R/tests/testdir_demos/runit_demo_VI_all_algos.R
data.hex = h2o.importFile(
  path = "https://raw.github.com/0xdata/h2o/master/smalldata/bank-additional-full.csv",
  key = "data.hex")
myX = 1:20
my.gbm <- h2o.gbm(x = myX, y = myY, distribution = "bernoulli", data = data.hex, n.trees =100,
                  interaction.depth = 2, shrinkage = 0.01, importance = T)
gbm.VI = my.gbm@model$varimp
barplot(t(gbm.VI[1]),las=2,main="VI from GBM")

