Builds a Uplift Random Forest model on an H2OFrame.
h2o.upliftRandomForest(
x,
y,
training_frame,
treatment_column,
model_id = NULL,
validation_frame = NULL,
score_each_iteration = FALSE,
score_tree_interval = 0,
ignore_const_cols = TRUE,
ntrees = 50,
max_depth = 20,
min_rows = 1,
nbins = 20,
nbins_top_level = 1024,
nbins_cats = 1024,
max_runtime_secs = 0,
seed = -1,
mtries = -2,
sample_rate = 0.632,
sample_rate_per_class = NULL,
col_sample_rate_change_per_level = 1,
col_sample_rate_per_tree = 1,
histogram_type = c("AUTO", "UniformAdaptive", "Random", "QuantilesGlobal",
"RoundRobin", "UniformRobust"),
categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit",
"Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
distribution = c("AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma",
"tweedie", "laplace", "quantile", "huber"),
check_constant_response = TRUE,
uplift_metric = c("AUTO", "KL", "Euclidean", "ChiSquared"),
auuc_type = c("AUTO", "qini", "lift", "gain"),
auuc_nbins = -1,
verbose = FALSE
)
Creates a H2OModel object of the right type.
(Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.
The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model.
Id of the training data frame.
Define the column which will be used for computing uplift gain to select best split for a tree. The column has to divide the dataset into treatment (value 1) and control (value 0) groups. Defaults to treatment.
Destination id for this model; auto-generated if not specified.
Id of the validation data frame.
Logical
. Whether to score during each iteration of model training. Defaults to FALSE.
Score the model after every so many trees. Disabled if set to 0. Defaults to 0.
Logical
. Ignore constant columns. Defaults to TRUE.
Number of trees. Defaults to 50.
Maximum tree depth (0 for unlimited). Defaults to 20.
Fewest allowed (weighted) observations in a leaf. Defaults to 1.
For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point Defaults to 20.
For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level Defaults to 1024.
For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. Defaults to 1024.
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).
Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrtp for classification and p/3 for regression (where p is the # of predictors Defaults to -2.
Row sample rate per tree (from 0.0 to 1.0) Defaults to 0.632.
A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0) Defaults to 1.
Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.
What type of histogram to use for finding optimal split points Must be one of: "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin", "UniformRobust". Defaults to AUTO.
Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.
Distribution function Must be one of: "AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber". Defaults to AUTO.
Logical
. Check if response column is constant. If enabled, then an exception is thrown if the response
column is a constant value.If disabled, then model will train regardless of the response column being a
constant value or not. Defaults to TRUE.
Divergence metric used to find best split when building an uplift tree. Must be one of: "AUTO", "KL", "Euclidean", "ChiSquared". Defaults to AUTO.
Metric used to calculate Area Under Uplift Curve. Must be one of: "AUTO", "qini", "lift", "gain". Defaults to AUTO.
Number of bins to calculate Area Under Uplift Curve. Defaults to -1.
Logical
. Print scoring history to the console (Metrics per tree). Defaults to FALSE.
predict.H2OModel
for prediction