Trains an Isolation Forest model
h2o.isolationForest(
training_frame,
x,
model_id = NULL,
score_each_iteration = FALSE,
score_tree_interval = 0,
ignore_const_cols = TRUE,
ntrees = 50,
max_depth = 8,
min_rows = 1,
max_runtime_secs = 0,
seed = -1,
build_tree_one_node = FALSE,
mtries = -1,
sample_size = 256,
sample_rate = -1,
col_sample_rate_change_per_level = 1,
col_sample_rate_per_tree = 1,
categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit",
"Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
stopping_rounds = 0,
stopping_metric = c("AUTO", "anomaly_score"),
stopping_tolerance = 0.01,
export_checkpoints_dir = NULL,
contamination = -1,
validation_frame = NULL,
validation_response_column = NULL
)
Id of the training data frame.
A vector containing the character
names of the predictors in the model.
Destination id for this model; auto-generated if not specified.
Logical
. Whether to score during each iteration of model training. Defaults to FALSE.
Score the model after every so many trees. Disabled if set to 0. Defaults to 0.
Logical
. Ignore constant columns. Defaults to TRUE.
Number of trees. Defaults to 50.
Maximum tree depth (0 for unlimited). Defaults to 8.
Fewest allowed (weighted) observations in a leaf. Defaults to 1.
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).
Logical
. Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
Defaults to FALSE.
Number of variables randomly sampled as candidates at each split. If set to -1, defaults (number of predictors)/3. Defaults to -1.
Number of randomly sampled observations used to train each Isolation Forest tree. Only one of parameters sample_size and sample_rate should be defined. If sample_rate is defined, sample_size will be ignored. Defaults to 256.
Rate of randomly sampled observations used to train each Isolation Forest tree. Needs to be in range from 0.0 to 1.0. If set to -1, sample_rate is disabled and sample_size will be used instead. Defaults to -1.
Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0) Defaults to 1.
Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.
Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.
Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0.
Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client. Must be one of: "AUTO", "anomaly_score". Defaults to AUTO.
Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.01.
Automatically export generated models to this directory.
Contamination ratio - the proportion of anomalies in the input dataset. If undefined (-1) the predict function will not mark observations as anomalies and only anomaly score will be returned. Defaults to -1 (undefined). Defaults to -1.
Id of the validation data frame.
(experimental) Name of the response column in the validation frame. Response column should be binary and indicate not anomaly/anomaly.
if (FALSE) {
library(h2o)
h2o.init()
# Import the cars dataset
f <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv"
cars <- h2o.importFile(f)
# Set the predictors
predictors <- c("displacement", "power", "weight", "acceleration", "year")
# Train the IF model
cars_if <- h2o.isolationForest(x = predictors, training_frame = cars,
seed = 1234, stopping_metric = "anomaly_score",
stopping_rounds = 3, stopping_tolerance = 0.1)
}
Run the code above in your browser using DataLab