Builds a eXtreme Gradient Boosting model using the native XGBoost backend
h2o.xgboost(x, y, training_frame, model_id = NULL, validation_frame = NULL,
nfolds = 0, keep_cross_validation_predictions = FALSE,
keep_cross_validation_fold_assignment = FALSE,
score_each_iteration = FALSE, fold_assignment = c("AUTO", "Random",
"Modulo", "Stratified"), fold_column = NULL, ignore_const_cols = TRUE,
offset_column = NULL, weights_column = NULL, stopping_rounds = 0,
stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE",
"RMSLE", "AUC", "lift_top_group", "misclassification",
"mean_per_class_error"), stopping_tolerance = 0.001, max_runtime_secs = 0,
seed = -1, distribution = c("AUTO", "bernoulli", "multinomial",
"gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"),
tweedie_power = 1.5, ntrees = 50, max_depth = 5, min_rows = 10,
min_child_weight = 0, learn_rate = 0.1, eta = 0, sample_rate = 1,
subsample = 0, col_sample_rate = 1, colsample_bylevel = 0,
col_sample_rate_per_tree = 1, colsample_bytree = 0,
max_abs_leafnode_pred = 3.4028235e+38, max_delta_step = 0,
score_tree_interval = 0, min_split_improvement = 0, max_bin = 255,
num_leaves = 255, min_sum_hessian_in_leaf = 100, min_data_in_leaf = 0,
tree_method = c("auto", "exact", "approx", "hist"),
grow_policy = c("depthwise", "lossguide"), booster = c("gbtree",
"gblinear", "dart"), gamma = 0, reg_lambda = 1, reg_alpha = 0,
dmatrix_type = c("auto", "dense", "sparse"), backend = c("auto", "gpu",
"cpu"), gpu_id = 0)
A vector containing the names or indices of the predictor variables to use in building the model. If x is missing,then all columns except y are used.
The name of the response variable in the model.If the data does not contain a header, this is the first column index, and increasing from left to right. (The response must be either an integer or a categorical variable).
Id of the training data frame (Not required, to allow initial validation of model parameters).
Destination id for this model; auto-generated if not specified.
Id of the validation data frame.
Number of folds for N-fold cross-validation (0 to disable or >= 2). Defaults to 0.
Logical
. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.
Logical
. Whether to keep the cross-validation fold assignment. Defaults to FALSE.
Logical
. Whether to score during each iteration of model training. Defaults to FALSE.
Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.
Column with cross-validation fold index assignment per observation.
Logical
. Ignore constant columns. Defaults to TRUE.
Offset column. This will be added to the combination of columns before applying the link function.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0.
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression) Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to AUTO.
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.001.
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default) Defaults to -1 (time-based random number).
Distribution function Must be one of: "AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber". Defaults to AUTO.
Tweedie power for Tweedie regression, must be between 1 and 2. Defaults to 1.5.
(same as n_estimators) Number of trees. Defaults to 50.
Maximum tree depth. Defaults to 5.
(same as min_child_weight) Fewest allowed (weighted) observations in a leaf. Defaults to 10.
(same as min_rows) Fewest allowed (weighted) observations in a leaf. Defaults to 0.
(same as eta) Learning rate (from 0.0 to 1.0) Defaults to 0.1.
(same as learn_rate) Learning rate (from 0.0 to 1.0) Defaults to 0.
(same as subsample) Row sample rate per tree (from 0.0 to 1.0) Defaults to 1.
(same as sample_rate) Row sample rate per tree (from 0.0 to 1.0) Defaults to 0.
(same as colsample_bylevel) Column sample rate (from 0.0 to 1.0) Defaults to 1.
(same as col_sample_rate) Column sample rate (from 0.0 to 1.0) Defaults to 0.
(same as colsample_bytree) Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.
(same as col_sample_rate_per_tree) Column sample rate per tree (from 0.0 to 1.0) Defaults to 0.
(same as max_delta_step) Maximum absolute value of a leaf node prediction Defaults to 3.4028235e+38.
(same as max_abs_leafnode_pred) Maximum absolute value of a leaf node prediction Defaults to 0.0.
Score the model after every so many trees. Disabled if set to 0. Defaults to 0.
(same as gamma) Minimum relative improvement in squared error reduction for a split to happen Defaults to 0.0.
For tree_method=hist only: maximum number of bins Defaults to 255.
For tree_method=hist only: maximum number of leaves Defaults to 255.
For tree_method=hist only: the mininum sum of hessian in a leaf to keep splitting Defaults to 100.0.
For tree_method=hist only: the mininum data in a leaf to keep splitting Defaults to 0.0.
Tree method Must be one of: "auto", "exact", "approx", "hist". Defaults to auto.
Grow policy - depthwise is standard GBM, lossguide is LightGBM Must be one of: "depthwise", "lossguide". Defaults to depthwise.
Booster type Must be one of: "gbtree", "gblinear", "dart". Defaults to gbtree.
(same as min_split_improvement) Minimum relative improvement in squared error reduction for a split to happen Defaults to 0.0.
L2 regularization Defaults to 1.0.
L1 regularization Defaults to 0.0.
Type of DMatrix. For sparse, NAs and 0 are treated equally. Must be one of: "auto", "dense", "sparse". Defaults to auto.
Backend. By default (auto), a GPU is used if available. Must be one of: "auto", "gpu", "cpu". Defaults to auto.
Which GPU to use. Defaults to 0.