Usage
h2o.randomForest(x, y, data, key = "", classification = TRUE, ntree = 50,
depth = 20, mtries = -1, sample.rate = 2/3, nbins = 20, seed = -1,
importance = FALSE, nfolds = 0, validation, nodesize = 1,
balance.classes = FALSE, max.after.balance.size = 5, doGrpSplit = TRUE,
verbose = FALSE, oobee = TRUE, stat.type = "ENTROPY", type = "fast")
Arguments
x
A vector containing the names or indices of the predictor variables to use in building the random forest model.
y
The name or index of the response variable. If the data does not contain a header, this is the column index,
designated by increasing numbers from left to right. (The response must be either an integer or a categorical variable).
data
An H2OParsedData
object containing the variables in the model.
key
(Optional) The unique hex key assigned to the resulting model. If none is given, a key will automatically be generated.
classification
(Optional) A logical value indicating whether a classification model should be built (as opposed to regression).
ntree
(Optional) Number of trees to grow. (Must be a nonnegative integer).
depth
(Optional) Maximum depth to grow the tree.
mtries
(Optional) Number of variables randomly sampled as candidates at each split.
If set to -1, defaults to sqrt{p} for classification, and p/3 for regression, where p is the number of predictors.
sample.rate
(Optional) Sampling rate for constructing data from which individual trees are grown.
nbins
(Optional) Build a histogram of this many bins, then split at best point.
seed
(Optional) Seed for building the random forest. If seed = -1
, one will automatically be generated by H2O.
importance
(Optional) A logical value indicating whether to calculate variable importance. Set to FALSE
to speed
up computations.
nfolds
(Optional) Number of folds for cross-validation. If nfolds >= 2
, then validation
must remain empty.
validation
(Optional) An H2OParsedData
object indicating the validation dataset used to construct
confusion matrix. If left blank, this defaults to the training data when nfolds = 0
.
nodesize
(Optional) Number of nodes to use for computation.
balance.classes
(Optional) Balance training data class counts via over/under-sampling (for imbalanced data)
max.after.balance.size
Maximum relative size of the training data after balancing
class counts (can be less than 1.0)
doGrpSplit
Check non-contiguous group splits for categorical predictors
verbose
(Optional) A logical value indicating whether verbose results should be returned.
stat.type
(Optional) Type of statistic to use, equal to either "ENTROPY" or "GINI".
oobee
(Optional) A logical value indicating whether to calculate the out of bag error estimate.
type
(Optional) Default is "fast" mode, which builds trees in parallel and distributed,
but requires all of the data to fit on a single node.
Alternate mode is "BigData" mode, which builds a random forest layer-by-layer across your cluster and
scales to