model_preprocess

Dataframe. Dataframe containing all your data, including
the independent variable labeled as <code>'tag'</code>. If you want to define
which variable should be used instead, use the <code>y</code> parameter.

Character. Column name for independent variable.

Character vector. Force columns for the model to ignore

ignore

Character. If needed, <code>df</code>'s column name with 'test'
and 'train' values to split

train_test

Numeric. Value between 0 and 1 to split as train/test
datasets. Value is for training set. Set value to 1 to train with all
available data and test with same data (cross-validation will still be
used when training). If <code>train_test</code> is set, value will be overwritten
with its real split rate.

split

Column with observation weights. Giving some observation a
weight of zero is equivalent to excluding it from the dataset; giving an
observation a relative weight of 2 is equivalent to repeating that
row twice. Negative weights are not allowed.

weight

Value. Which is your target positive value? If
set to <code>'auto'</code>, the target with largest <code>mean(score)</code> will be
selected. Change the value to overwrite. Only used when binary
categorical model.

target

Boolean. Auto-balance train dataset with under-sampling?

balance

Boolean. Fill <code>NA</code> values with MICE?

impute

Boolean/Numeric. Remove <code>y</code>'s outliers from the dataset?
Will remove those values that are farther than n standard deviations from
the independent variable's mean (Z-score). Set to <code>TRUE</code> for default (3)
or numeric to set a different multiplier.

no_outliers

Boolean. Keep only unique row observations for training data?

unique_train

Boolean. Using the base function scale, do you wish
to center and/or scale all numerical values?

center

scale

Integer. Threshold for selecting binary or regression
models: this number is the threshold of unique values we should
have in <code>'tag'</code> (more than: regression; less than: classification)

thresh

Integer. Set a seed for reproducibility. AutoML can only
guarantee reproducibility if max_models is used because max_time is
resource limited.

seed

Boolean. Quiet all messages, warnings, recommendations?

quiet

Pre-process your data before training a model. This is the prior step
on the <code>h2o_automl()</code> function's pipeline. Enabling for
other use cases when wanting too use any other framework, library,
or custom algorithm.

Auxiliary package for better/faster analytics, visualization, data mining, and machine learning
tasks. With a wide variety of family functions, like Machine Learning, Data Wrangling,
Exploratory, and Scrapper, it helps the analyst or data scientist to get quick and robust
results, without the need of repetitive coding or extensive programming skills.

model_preprocess: Automate Data Preprocess for Modeling

Description

Usage

Arguments

Value

See Also

Examples