h2o_automl: Automated H2O's AutoML

Description

This function lets the user create a robust and fast model, using H2O's AutoML function. The result is a list with the best model, its parameters, datasets, performance metrics, variables importance, and plots. Read more about the h2o_automl() pipeline here.

Usage

h2o_automl(
  df,
  y = "tag",
  ignore = NULL,
  train_test = NA,
  split = 0.7,
  weight = NULL,
  target = "auto",
  balance = FALSE,
  impute = FALSE,
  no_outliers = TRUE,
  unique_train = TRUE,
  center = FALSE,
  scale = FALSE,
  thresh = 10,
  seed = 0,
  nfolds = 5,
  max_models = 3,
  max_time = 10 * 60,
  start_clean = FALSE,
  exclude_algos = c("StackedEnsemble", "DeepLearning"),
  include_algos = NULL,
  plots = TRUE,
  alarm = TRUE,
  quiet = FALSE,
  print = TRUE,
  save = FALSE,
  subdir = NA,
  project = "AutoML Results",
  verbosity = NULL,
  ...
)
# S3 method for h2o_automl
plot(x, ...)
# S3 method for h2o_automl
print(x, importance = TRUE, ...)

Value

List. Trained model, predicted scores and datasets used, performance metrics, parameters, importance data.frame, seed, and plots when plots=TRUE.

Arguments

df: Dataframe. Dataframe containing all your data, including the dependent variable labeled as 'tag'. If you want to define which variable should be used instead, use the y parameter.
y: Variable or Character. Name of the dependent variable or response.
ignore: Character vector. Force columns for the model to ignore
train_test: Character. If needed, df's column name with 'test' and 'train' values to split
split: Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train with all available data and test with same data (cross-validation will still be used when training). If train_test is set, value will be overwritten with its real split rate.
weight: Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
target: Value. Which is your target positive value? If set to 'auto', the target with largest mean(score) will be selected. Change the value to overwrite. Only used when binary categorical model.
balance: Boolean. Auto-balance train dataset with under-sampling?
impute: Boolean. Fill NA values with MICE?
no_outliers: Boolean/Numeric. Remove y's outliers from the dataset? Will remove those values that are farther than n standard deviations from the dependent variable's mean (Z-score). Set to TRUE for default (3) or numeric to set a different multiplier.
unique_train: Boolean. Keep only unique row observations for training data?
center, scale: Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
thresh: Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in 'tag' (more than: regression; less than: classification)
seed: Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
nfolds: Number of folds for k-fold cross-validation. Must be >= 2; defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance).
max_models, max_time: Numeric. Max number of models and seconds you wish for the function to iterate. Note that max_models guarantees reproducibility and max_time not (because it depends entirely on your machine's computational characteristics)
start_clean: Boolean. Erase everything in the current h2o instance before we start to train models? You may want to keep other models or not. To group results into a custom common AutoML project, you may use project_name argument.
exclude_algos, include_algos: Vector of character strings. Algorithms to skip or include during the model-building phase. Set NULL to ignore. When both are defined, only include_algos will be valid.
plots: Boolean. Create plots objects?
alarm: Boolean. Ping (sound) when done. Requires beepr.
quiet: Boolean. Quiet all messages, warnings, recommendations?
print: Boolean. Print summary when process ends?
save: Boolean. Do you wish to save/export results into your working directory?
subdir: Character. In which directory do you wish to save the results? Working directory as default.
project: Character. Your project's name
verbosity: Verbosity of the backend messages printed during training; Optional. Must be one of NULL (live log disabled), "debug", "info", "warn", "error". Defaults to "warn".
...: Additional parameters on h2o::h2o.automl
x: h2o_automl object
importance: Boolean. Print important variables?

List of algorithms

-> Read more here

DRF: Distributed Random Forest, including Random Forest (RF) and Extremely-Randomized Trees (XRT)
GLM: Generalized Linear Model
XGBoost: eXtreme Grading Boosting
GBM: Gradient Boosting Machine
DeepLearning: Fully-connected multi-layer artificial neural network
StackedEnsemble: Stacked Ensemble

Methods

print: Use print method to print models stats and summary

plot

Use plot method to plot results using mplot_full()

Examples

Run this code

if (FALSE) {
# CRAN
data(dft) # Titanic dataset
dft <- subset(dft, select = -c(Ticket, PassengerId, Cabin))

# Classification: Binomial - 2 Classes
r <- h2o_automl(dft, y = Survived, max_models = 1, impute = FALSE, target = "TRUE", alarm = FALSE)

# Let's see all the stuff we have inside:
lapply(r, names)

# Classification: Multi-Categorical - 3 Classes
r <- h2o_automl(dft, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)

# Regression: Continuous Values
r <- h2o_automl(dft, y = "Fare", ignore = c("Pclass"), exclude_algos = NULL, quiet = TRUE)
print(r)

# WITH PRE-DEFINED TRAIN/TEST DATAFRAMES
splits <- msplit(dft, size = 0.8)
splits$train$split <- "train"
splits$test$split <- "test"
df <- rbind(splits$train, splits$test)
r <- h2o_automl(df, "Survived", max_models = 1, train_test = "split")
}

Run the code above in your browser using DataLab