- df
Dataframe. Dataframe containing all your data, including
the dependent variable labeled as 'tag'
. If you want to define
which variable should be used instead, use the y
parameter.
- y
Variable or Character. Name of the dependent variable or response.
- ignore
Character vector. Force columns for the model to ignore
- train_test
Character. If needed, df
's column name with 'test'
and 'train' values to split
- split
Numeric. Value between 0 and 1 to split as train/test
datasets. Value is for training set. Set value to 1 to train with all
available data and test with same data (cross-validation will still be
used when training). If train_test
is set, value will be overwritten
with its real split rate.
- weight
Column with observation weights. Giving some observation a
weight of zero is equivalent to excluding it from the dataset; giving an
observation a relative weight of 2 is equivalent to repeating that
row twice. Negative weights are not allowed.
- target
Value. Which is your target positive value? If
set to 'auto'
, the target with largest mean(score)
will be
selected. Change the value to overwrite. Only used when binary
categorical model.
- balance
Boolean. Auto-balance train dataset with under-sampling?
- impute
Boolean. Fill NA
values with MICE?
- no_outliers
Boolean/Numeric. Remove y
's outliers from the dataset?
Will remove those values that are farther than n standard deviations from
the dependent variable's mean (Z-score). Set to TRUE
for default (3)
or numeric to set a different multiplier.
- unique_train
Boolean. Keep only unique row observations for training data?
- center, scale
Boolean. Using the base function scale, do you wish
to center and/or scale all numerical values?
- thresh
Integer. Threshold for selecting binary or regression
models: this number is the threshold of unique values we should
have in 'tag'
(more than: regression; less than: classification)
- seed
Integer. Set a seed for reproducibility. AutoML can only
guarantee reproducibility if max_models is used because max_time is
resource limited.
- nfolds
Number of folds for k-fold cross-validation. Must be >= 2; defaults to 5. Use 0 to disable cross-validation;
this will also disable Stacked Ensemble (thus decreasing the overall model performance).
- max_models, max_time
Numeric. Max number of models and seconds
you wish for the function to iterate. Note that max_models guarantees
reproducibility and max_time not (because it depends entirely on your
machine's computational characteristics)
- start_clean
Boolean. Erase everything in the current h2o
instance before we start to train models? You may want to keep other models
or not. To group results into a custom common AutoML project, you may
use project_name
argument.
- exclude_algos, include_algos
Vector of character strings. Algorithms
to skip or include during the model-building phase. Set NULL to ignore.
When both are defined, only include_algos
will be valid.
- plots
Boolean. Create plots objects?
- alarm
Boolean. Ping (sound) when done. Requires beepr
.
- quiet
Boolean. Quiet all messages, warnings, recommendations?
- print
Boolean. Print summary when process ends?
- save
Boolean. Do you wish to save/export results into your
working directory?
- subdir
Character. In which directory do you wish to save
the results? Working directory as default.
- project
Character. Your project's name
- verbosity
Verbosity of the backend messages printed during training; Optional.
Must be one of NULL (live log disabled), "debug", "info", "warn", "error". Defaults to "warn".
- ...
Additional parameters on h2o::h2o.automl
- x
h2o_automl object
- importance
Boolean. Print important variables?