h2o.randomForest: H2O: Random Forest

Description

Performs random forest classification on a data set.

Usage

h2o.randomForest(x, y, data, key = "", classification = TRUE, ntree = 50, 
  depth = 20, mtries = -1, sample.rate = 2/3, nbins = 20, seed = -1,
  importance = FALSE, nfolds = 0, validation, nodesize = 1, 
  balance.classes = FALSE, max.after.balance.size = 5, doGrpSplit = TRUE,
  verbose = FALSE, oobee = TRUE, stat.type = "ENTROPY", type = "fast")

Arguments

A vector containing the names or indices of the predictor variables to use in building the random forest model.

The name or index of the response variable. If the data does not contain a header, this is the column index, designated by increasing numbers from left to right. (The response must be either an integer or a categorical variable).

data

An H2OParsedData object containing the variables in the model.

key

(Optional) The unique hex key assigned to the resulting model. If none is given, a key will automatically be generated.

classification

(Optional) A logical value indicating whether a classification model should be built (as opposed to regression).

ntree

(Optional) Number of trees to grow. (Must be a nonnegative integer).

depth

(Optional) Maximum depth to grow the tree.

mtries

(Optional) Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification, and p/3 for regression, where p is the number of predictors.

sample.rate

(Optional) Sampling rate for constructing data from which individual trees are grown.

nbins

(Optional) Build a histogram of this many bins, then split at best point.

seed

(Optional) Seed for building the random forest. If seed = -1, one will automatically be generated by H2O.

importance

(Optional) A logical value indicating whether to calculate variable importance. Set to FALSE to speed up computations.

nfolds

(Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.

validation

(Optional) An H2OParsedData object indicating the validation dataset used to construct confusion matrix. If left blank, this defaults to the training data when nfolds = 0.

nodesize

(Optional) Number of nodes to use for computation.

balance.classes

(Optional) Balance training data class counts via over/under-sampling (for imbalanced data)

max.after.balance.size

Maximum relative size of the training data after balancing class counts (can be less than 1.0)

doGrpSplit

Check non-contiguous group splits for categorical predictors

verbose

(Optional) A logical value indicating whether verbose results should be returned.

stat.type

(Optional) Type of statistic to use, equal to either "ENTROPY" or "GINI".

oobee

(Optional) A logical value indicating whether to calculate the out of bag error estimate.

type

(Optional) Default is "fast" mode, which builds trees in parallel and distributed, but requires all of the data to fit on a single node. Alternate mode is "BigData" mode, which builds a random forest layer-by-layer across your cluster and scales to

Value

An object of class H2ODRFModel with slots key, data, and model, where the last is a list of the following components:
ntreeNumber of trees grown.
mseMean-squared error for each tree.
forestA matrix giving the minimum, mean, and maximum of the tree depth and number of leaves.
confusionConfusion matrix of the prediction.

Examples

Run this code

# -- CRAN examples begin --
# Run an RF model on iris data
library(h2o)
localH2O = h2o.init()
irisPath = system.file("extdata", "iris.csv", package = "h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
h2o.randomForest(y = 5, x = c(2,3,4), data = iris.hex, ntree = 50, depth = 100)
# -- CRAN examples end --

# RF variable importance
# Also see:
#   https://github.com/0xdata/h2o/blob/master/R/tests/testdir_demos/runit_demo_VI_all_algos.R
data.hex = h2o.importFile(
  localH2O,
  path = "https://raw.github.com/0xdata/h2o/master/smalldata/bank-additional-full.csv",
  key = "data.hex")
myX = 1:20
myY="y"
my.rf = h2o.randomForest(x=myX,y=myY,data=data.hex,classification=T,ntree=100,importance=T)
rf.VI = my.rf@model$varimp
print(rf.VI)

Run the code above in your browser using DataLab