Pre-process your data before training a model. This is the prior step
on the h2o_automl()
function's pipeline. Enabling for
other use cases when wanting too use any other framework, library,
or custom algorithm.
model_preprocess(
df,
y = "tag",
ignore = NULL,
train_test = NA,
split = 0.7,
weight = NULL,
target = "auto",
balance = FALSE,
impute = FALSE,
no_outliers = TRUE,
unique_train = TRUE,
center = FALSE,
scale = FALSE,
thresh = 10,
seed = 0,
quiet = FALSE
)
Dataframe. Dataframe containing all your data, including
the independent variable labeled as 'tag'
. If you want to define
which variable should be used instead, use the y
parameter.
Character. Column name for independent variable.
Character vector. Force columns for the model to ignore
Character. If needed, df
's column name with 'test'
and 'train' values to split
Numeric. Value between 0 and 1 to split as train/test
datasets. Value is for training set. Set value to 1 to train with all
available data and test with same data (cross-validation will still be
used when training). If train_test
is set, value will be overwritten
with its real split rate.
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
Value. Which is your target positive value? If
set to 'auto'
, the target with largest mean(score)
will be
selected. Change the value to overwrite. Only used when binary
categorical model.
Boolean. Auto-balance train dataset with under-sampling?
Boolean. Fill NA
values with MICE?
Boolean/Numeric. Remove y
's outliers from the dataset?
Will remove those values that are farther than n standard deviations from
the independent variable's mean (Z-score). Set to TRUE
for default (3)
or numeric to set a different multiplier.
Boolean. Keep only unique row observations for training data?
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
Integer. Threshold for selecting binary or regression
models: this number is the threshold of unique values we should
have in 'tag'
(more than: regression; less than: classification)
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
Boolean. Quiet all messages, warnings, recommendations?
List. Contains original data.frame df
, an index
to identify which observations with be part of the train dataset
train_index
, and which model type should be model_type
.
Other Machine Learning:
ROC()
,
conf_mat()
,
export_results()
,
gain_lift()
,
h2o_automl()
,
h2o_predict_API()
,
h2o_predict_MOJO()
,
h2o_predict_binary()
,
h2o_predict_model()
,
h2o_selectmodel()
,
impute()
,
iter_seeds()
,
lasso_vars()
,
model_metrics()
,
msplit()
# NOT RUN {
data(dft) # Titanic dataset
model_preprocess(dft, "Survived", balance = TRUE)
model_preprocess(dft, "Fare", split = 0.5, scale = TRUE)
model_preprocess(dft, "Pclass", ignore = c("Fare", "Cabin"))
model_preprocess(dft, "Pclass", quiet = TRUE)
# }
Run the code above in your browser using DataLab