Builds a eXtreme Gradient Boosting model using the native XGBoost backend.
h2o.xgboost(
x,
y,
training_frame,
model_id = NULL,
validation_frame = NULL,
nfolds = 0,
keep_cross_validation_models = TRUE,
keep_cross_validation_predictions = FALSE,
keep_cross_validation_fold_assignment = FALSE,
score_each_iteration = FALSE,
fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"),
fold_column = NULL,
ignore_const_cols = TRUE,
offset_column = NULL,
weights_column = NULL,
stopping_rounds = 0,
stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE",
"AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error",
"custom", "custom_increasing"),
stopping_tolerance = 0.001,
max_runtime_secs = 0,
seed = -1,
distribution = c("AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma",
"tweedie", "laplace", "quantile", "huber"),
tweedie_power = 1.5,
categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit",
"Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
quiet_mode = TRUE,
checkpoint = NULL,
export_checkpoints_dir = NULL,
ntrees = 50,
max_depth = 6,
min_rows = 1,
min_child_weight = 1,
learn_rate = 0.3,
eta = 0.3,
sample_rate = 1,
subsample = 1,
col_sample_rate = 1,
colsample_bylevel = 1,
col_sample_rate_per_tree = 1,
colsample_bytree = 1,
colsample_bynode = 1,
max_abs_leafnode_pred = 0,
max_delta_step = 0,
monotone_constraints = NULL,
interaction_constraints = NULL,
score_tree_interval = 0,
min_split_improvement = 0,
gamma = 0,
nthread = -1,
save_matrix_directory = NULL,
build_tree_one_node = FALSE,
calibrate_model = FALSE,
calibration_frame = NULL,
calibration_method = c("AUTO", "PlattScaling", "IsotonicRegression"),
max_bins = 256,
max_leaves = 0,
sample_type = c("uniform", "weighted"),
normalize_type = c("tree", "forest"),
rate_drop = 0,
one_drop = FALSE,
skip_drop = 0,
tree_method = c("auto", "exact", "approx", "hist"),
grow_policy = c("depthwise", "lossguide"),
booster = c("gbtree", "gblinear", "dart"),
reg_lambda = 1,
reg_alpha = 0,
dmatrix_type = c("auto", "dense", "sparse"),
backend = c("auto", "gpu", "cpu"),
gpu_id = NULL,
gainslift_bins = -1,
auc_type = c("AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO",
"WEIGHTED_OVO"),
scale_pos_weight = 1,
verbose = FALSE
)
(Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.
The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model.
Id of the training data frame.
Destination id for this model; auto-generated if not specified.
Id of the validation data frame.
Number of folds for K-fold cross-validation (0 to disable or >= 2). Defaults to 0.
Logical
. Whether to keep the cross-validation models. Defaults to TRUE.
Logical
. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.
Logical
. Whether to keep the cross-validation fold assignment. Defaults to FALSE.
Logical
. Whether to score during each iteration of model training. Defaults to FALSE.
Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.
Column with cross-validation fold index assignment per observation.
Logical
. Ignore constant columns. Defaults to TRUE.
Offset column. This will be added to the combination of columns before applying the link function.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0.
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client. Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing". Defaults to AUTO.
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.001.
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).
Distribution function Must be one of: "AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber". Defaults to AUTO.
Tweedie power for Tweedie regression, must be between 1 and 2. Defaults to 1.5.
Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.
Logical
. Enable quiet mode Defaults to TRUE.
Model checkpoint to resume training with.
Automatically export generated models to this directory.
(same as n_estimators) Number of trees. Defaults to 50.
Maximum tree depth (0 for unlimited). Defaults to 6.
(same as min_child_weight) Fewest allowed (weighted) observations in a leaf. Defaults to 1.
(same as min_rows) Fewest allowed (weighted) observations in a leaf. Defaults to 1.
(same as eta) Learning rate (from 0.0 to 1.0) Defaults to 0.3.
(same as learn_rate) Learning rate (from 0.0 to 1.0) Defaults to 0.3.
(same as subsample) Row sample rate per tree (from 0.0 to 1.0) Defaults to 1.
(same as sample_rate) Row sample rate per tree (from 0.0 to 1.0) Defaults to 1.
(same as colsample_bylevel) Column sample rate (from 0.0 to 1.0) Defaults to 1.
(same as col_sample_rate) Column sample rate (from 0.0 to 1.0) Defaults to 1.
(same as colsample_bytree) Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.
(same as col_sample_rate_per_tree) Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.
Column sample rate per tree node (from 0.0 to 1.0) Defaults to 1.
(same as max_delta_step) Maximum absolute value of a leaf node prediction Defaults to 0.0.
(same as max_abs_leafnode_pred) Maximum absolute value of a leaf node prediction Defaults to 0.0.
A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.
A set of allowed column interactions.
Score the model after every so many trees. Disabled if set to 0. Defaults to 0.
(same as gamma) Minimum relative improvement in squared error reduction for a split to happen Defaults to 0.0.
(same as min_split_improvement) Minimum relative improvement in squared error reduction for a split to happen Defaults to 0.0.
Number of parallel threads that can be used to run XGBoost. Cannot exceed H2O cluster limits (-nthreads parameter). Defaults to maximum available Defaults to -1.
Directory where to save matrices passed to XGBoost library. Useful for debugging.
Logical
. Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
Defaults to FALSE.
Logical
. Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class
probabilities. Calibration can provide more accurate estimates of class probabilities. Defaults to FALSE.
Data for model calibration
Calibration method to use Must be one of: "AUTO", "PlattScaling", "IsotonicRegression". Defaults to AUTO.
For tree_method=hist only: maximum number of bins Defaults to 256.
For tree_method=hist only: maximum number of leaves Defaults to 0.
For booster=dart only: sample_type Must be one of: "uniform", "weighted". Defaults to uniform.
For booster=dart only: normalize_type Must be one of: "tree", "forest". Defaults to tree.
For booster=dart only: rate_drop (0..1) Defaults to 0.0.
Logical
. For booster=dart only: one_drop Defaults to FALSE.
For booster=dart only: skip_drop (0..1) Defaults to 0.0.
Tree method Must be one of: "auto", "exact", "approx", "hist". Defaults to auto.
Grow policy - depthwise is standard GBM, lossguide is LightGBM Must be one of: "depthwise", "lossguide". Defaults to depthwise.
Booster type Must be one of: "gbtree", "gblinear", "dart". Defaults to gbtree.
L2 regularization Defaults to 1.0.
L1 regularization Defaults to 0.0.
Type of DMatrix. For sparse, NAs and 0 are treated equally. Must be one of: "auto", "dense", "sparse". Defaults to auto.
Backend. By default (auto), a GPU is used if available. Must be one of: "auto", "gpu", "cpu". Defaults to auto.
Which GPU(s) to use.
Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning. Defaults to -1.
Set default multinomial AUC type. Must be one of: "AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO", "WEIGHTED_OVO". Defaults to AUTO.
Controls the effect of observations with positive labels in relation to the observations with negative labels on gradient calculation. Useful for imbalanced problems. Defaults to 1.0.
Logical
. Print scoring history to the console (Metrics per tree). Defaults to FALSE.
if (FALSE) {
library(h2o)
h2o.init()
# Import the titanic dataset
f <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"
titanic <- h2o.importFile(f)
# Set predictors and response; set response as a factor
titanic['survived'] <- as.factor(titanic['survived'])
predictors <- setdiff(colnames(titanic), colnames(titanic)[2:3])
response <- "survived"
# Split the dataset into train and valid
splits <- h2o.splitFrame(data = titanic, ratios = .8, seed = 1234)
train <- splits[[1]]
valid <- splits[[2]]
# Train the XGB model
titanic_xgb <- h2o.xgboost(x = predictors, y = response,
training_frame = train, validation_frame = valid,
booster = "dart", normalize_type = "tree",
seed = 1234)
}
Run the code above in your browser using DataLab