Learn R Programming

solitude (version 1.1.3)

isolationForest: Fit an Isolation Forest

Description

'solitude' class implements the isolation forest method introduced by paper Isolation based Anomaly Detection (Liu, Ting and Zhou <doi:10.1145/2133360.2133363>). The extremely randomized trees (extratrees) required to build the isolation forest is grown using ranger function from ranger package.

Arguments

Design

$new() initiates a new 'solitude' object. The possible arguments are:

  • sample_size: (positive integer, default = 256) Number of observations in the dataset to used to build a tree in the forest

  • num_trees: (positive integer, default = 100) Number of trees to be built in the forest

  • replace: (boolean, default = FALSE) Whether the sample of observations should be chosen with replacement when sample_size is less than the number of observations in the dataset

  • seed: (positive integer, default = 101) Random seed for the forest

  • nproc: (NULL or a positive integer, default: NULL, means use all resources) Number of parallel threads to be used by ranger

  • respect_unordered_factors: (string, default: "partition")See respect.unordered.factors argument in ranger

  • max_depth: (positive number, default: ceiling(log2(sample_size))) See max.depth argument in ranger

$fit() fits a isolation forest for the given dataframe or sparse matrix, computes depths of terminal nodes of each tree and stores the anomaly scores and average depth values in $scores object as a data.table

$predict() returns anomaly scores for a new data as a data.table

Methods

Public methods

Method new()

Usage

isolationForest$new(
  sample_size = 256,
  num_trees = 100,
  replace = FALSE,
  seed = 101,
  nproc = NULL,
  respect_unordered_factors = NULL,
  max_depth = ceiling(log2(sample_size))
)

Method fit()

Usage

isolationForest$fit(dataset)

Method predict()

Usage

isolationForest$predict(data)

Method clone()

The objects of this class are cloneable with this method.

Usage

isolationForest$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Details

  • Parallelization: ranger is parallelized and by default uses all the resources. This is supported when nproc is set to NULL. The process of obtaining depths of terminal nodes (which is excuted with $fit() is called) may be parallelized separately by setting up a future backend.

Examples

Run this code
# NOT RUN {
library("solitude")
library("tidyverse")
library("mlbench")

data(PimaIndiansDiabetes)
PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes)
PimaIndiansDiabetes

splitter   = PimaIndiansDiabetes %>%
  select(-diabetes) %>%
  rsample::initial_split(prop = 0.5)
pima_train = rsample::training(splitter)
pima_test  = rsample::testing(splitter)

iso = isolationForest$new()
iso$fit(pima_train)

scores_train = pima_train %>%
  iso$predict() %>%
  arrange(desc(anomaly_score))

scores_train

umap_train = pima_train %>%
  scale() %>%
  uwot::umap() %>%
  setNames(c("V1", "V2")) %>%
  as_tibble() %>%
  rowid_to_column() %>%
  left_join(scores_train, by = c("rowid" = "id"))

umap_train

umap_train %>%
  ggplot(aes(V1, V2)) +
  geom_point(aes(size = anomaly_score))

scores_test = pima_test %>%
  iso$predict() %>%
  arrange(desc(anomaly_score))

scores_test
# }

Run the code above in your browser using DataLab