Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/feature/column at random. The idea is that, the rarer the observation, the more likely it is that a random uniform split on some feature would put outliers alone in one branch, and the fewer splits it will take to isolate an outlier observation like this. The concept is extended to splitting hyperplanes in the extended model (i.e. splitting by more than one column at a time), and to guided (not entirely random) splits in the SCiForest and FCF models that aim at isolating outliers faster and/or finding clustered outliers.
This version adds heuristics to handle missing data and categorical variables. Can be used to aproximate pairwise distances by checking the depth after which two observations become separated, and to approximate densities by fitting trees beyond balanced-tree limit. Offers options to vary between randomized and deterministic splits too.
Important: The default parameters in this software do not correspond to the suggested parameters in any of the references (see section "Matching models from references"). In particular, the following default values are likely to cause huge differences when compared to the defaults in other software: `ndim`, `sample_size`, `ntrees`. The defaults here are nevertheless more likely to result in better models. In order to mimic the Python library "scikit-learn" for example, one would need to pass `ndim=1`, `sample_size=256`, `ntrees=100`, `missing_action="fail"`, `nthreads=1`.
Note that the default parameters will not scale to large datasets. In particular, if the amount of data is large, it's suggested to set a smaller sample size for each tree (parameter `sample_size`), and to fit fewer of them (parameter `ntrees`). As well, the default option for `missing_action` might slow things down significantly (see below for details). These defaults can also result in very big model sizes in memory and as serialized files (e.g. models that weight over 10GB) when the number of rows in the data is large. Using fewer trees, smaller sample sizes, and shallower trees can help to reduce model sizes if that becomes a problem.
The model offers many tunable parameters (see reference [11] for a comparison). The most likely candidate to tune is `prob_pick_pooled_gain`, for which higher values tend to result in a better ability to flag outliers in multimodal datasets, at the expense of poorer generalizability to inputs with values outside the variables' ranges to which the model was fit (see plots generated from the examples for a better idea of the difference). The next candidate to tune is `sample_size` - the default is to use all rows, but in some datasets introducing sub-sampling can help, especially for the single-variable model. In smaller datasets, one might also want to experiment with `weigh_by_kurtosis` and perhaps lower `ndim`.If using `prob_pick_pooled_gain`, models are likely to benefit from deeper trees (controlled by `max_depth`), but using large samples and/or deeper trees can result in significantly slower model fitting and predictions - in such cases, using `min_gain` (with a value like 0.25) with `max_depth=NULL` can offer a better speed/performance trade-off than changing `max_depth`.
If the data has categorical variables and these are more important important for determining outlierness compared to numerical columns, one might want to experiment with `ndim=1`, `categ_split_type="single_categ"`, and `scoring_metric="density"`.
For small datasets, one might also want to experiment with `ndim=1`, `scoring_metric="adj_depth"` and `penalize_range=TRUE`.
isolation.forest(
data,
sample_size = min(nrow(data), 10000L),
ntrees = 500,
ndim = min(3, ncol(data)),
ntry = 1,
categ_cols = NULL,
max_depth = ceiling(log2(sample_size)),
ncols_per_tree = ncol(data),
prob_pick_pooled_gain = 0,
prob_pick_avg_gain = 0,
prob_pick_full_gain = 0,
prob_pick_dens = 0,
prob_pick_col_by_range = 0,
prob_pick_col_by_var = 0,
prob_pick_col_by_kurt = 0,
min_gain = 0,
missing_action = ifelse(ndim > 1, "impute", "divide"),
new_categ_action = ifelse(ndim > 1, "impute", "weighted"),
categ_split_type = ifelse(ndim > 1, "subset", "single_categ"),
all_perm = FALSE,
coef_by_prop = FALSE,
recode_categ = FALSE,
weights_as_sample_prob = TRUE,
sample_with_replacement = FALSE,
penalize_range = FALSE,
standardize_data = TRUE,
scoring_metric = "depth",
fast_bratio = TRUE,
weigh_by_kurtosis = FALSE,
coefs = "uniform",
assume_full_distr = TRUE,
build_imputer = FALSE,
output_imputations = FALSE,
min_imp_obs = 3,
depth_imp = "higher",
weigh_imp_rows = "inverse",
output_score = FALSE,
output_dist = FALSE,
square_dist = FALSE,
sample_weights = NULL,
column_weights = NULL,
seed = 1,
use_long_double = FALSE,
nthreads = parallel::detectCores()
)
If passing `output_score` = `FALSE`, `output_dist` = `FALSE`, and `output_imputations` = `FALSE` (the defaults), will output an `isolation_forest` object from which `predict` method can then be called on new data.
If passing `TRUE` to any of the former options, will output a list with entries:
`model`: the `isolation_forest` object from which new predictions can be made.
`scores`: a vector with the outlier score for each inpuit observation (if passing `output_score` = `TRUE`).
`dist`: the distances (either a `dist` object or a square matrix), if passing `output_dist` = `TRUE`.
`imputed`: the input data with missing values imputed according to the model (if passing `output_imputations` = `TRUE`).
Data to which to fit the model. Supported inputs type are:
A `data.frame`, also accepted as `data.table` or `tibble`.
A `matrix` object from base R.
A sparse matrix in CSC format, either from package `Matrix` (class `dgCMatrix`) or from package `SparseM` (class `matrix.csc`).
If passing a `data.frame`, will assume that columns are:
Numerical, if they are of types `numeric`, `integer`, `Date`, `POSIXct`.
Categorical, if they are of type `character`, `factor`, `bool`. Note that, if factors are ordered, the order will be ignored here.
Other input and column types are not supported.
Sample size of the data sub-samples with which each binary tree will be built. Recommended value in references [1], [2], [3], [4] is 256, while the default value in the author's code in reference [5] is `nrow(data)`.
If passing `NULL`, will take the full number of rows in the data (no sub-sampling).
If passing a number between zero and one, will assume it means taking a sample size that represents that proportion of the rows in the data.
Note that sub-sampling is incompatible with `output_score`, `output_dist`, and `output_imputations`, and if any of those options is requested, `sample_size` will be overriden.
Hint: seeing a distribution of scores which is on average too far below 0.5 could mean that the model needs more trees and/or bigger samples to reach convergence (unless using non-random splits, in which case the distribution is likely to be centered around a much lower number), or that the distributions in the data are too skewed for random uniform splits.
Number of binary trees to build for the model. Recommended value in reference [1] is 100, while the default value in the author's code in reference [5] is 10. In general, the number of trees required for good results is higher when (a) there are many columns, (b) there are categorical variables, (c) categorical variables have many categories, (d) `ndim` is high, (e) `prob_pick_pooled_gain` is used, (f) `scoring_metric="density"` or `scoring_metric="boxed_density"` are used.
Hint: seeing a distribution of scores which is on average too far below 0.5 could mean that the model needs more trees and/or bigger samples to reach convergence (unless using non-random splits, in which case the distribution is likely to be centered around a much lower number), or that the distributions in the data are too skewed for random uniform splits.
Number of columns to combine to produce a split. If passing 1, will produce the single-variable model described in references [1] and [2], while if passing values greater than 1, will produce the extended model described in references [3] and [4]. Recommended value in reference [4] is 2, while [3] recommends a low value such as 2 or 3. Models with values higher than 1 are referred hereafter as the extended model (as in reference [3]).
If passing `NULL`, will assume it means using the full number of columns in the data.
Note that, when using `ndim>1` plus `standardize_data=TRUE`, the variables are standardized at each step as suggested in [4], which makes the models slightly different than in [3].
In general, when the data has categorical variables, models with `ndim=1` plus `categ_split_type="single_categ"` tend to produce better results, while models `ndim>1` tend to produce better results for numerical-only data, especially in the presence of missing values.
When using any of `prob_pick_pooled_gain`, `prob_pick_avg_gain`, `prob_pick_full_gain`, `prob_pick_dens`, how many variables (with `ndim=1`) or linear combinations (with `ndim>1`) to try for determining the best one according to gain.
Recommended value in reference [4] is 10 (with `prob_pick_avg_gain`, for outlier detection), while the recommended value in reference [11] is 1 (with `prob_pick_pooled_gain`, for outlier detection), and the recommended value in reference [9] is 10 to 20 (with `prob_pick_pooled_gain`, for missing value imputations).
Columns that hold categorical features, when the data is passed as a matrix (either dense or sparse). Can be passed as an integer vector (numeration starting at 1) denoting the indices of the columns that are categorical, or as a character vector denoting the names of the columns that are categorical, assuming that `data` has column names.
Categorical columns should contain only integer values with a continuous numeration starting at zero (not at one as is typical in R packages), and with negative values and NA/NaN taken as missing. The maximum categorical value should not exceed `.Machine$integer.max` (typically \(2^{31}-1\)).
This is ignored when the input is passed as a `data.frame` as then it will consider columns as categorical depending on their type/class (see the documentation for `data` for details).
Maximum depth of the binary trees to grow. By default, will limit it to the corresponding depth of a balanced binary tree with number of terminal nodes corresponding to the sub-sample size (the reason being that, if trying to detect outliers, an outlier will only be so if it turns out to be isolated with shorter average depth than usual, which corresponds to a balanced tree depth). When a terminal node has more than 1 observation, the remaining isolation depth for them is estimated assuming the data and splits are both uniformly random (separation depth follows a similar process with expected value calculated as in reference [6]). Default setting for references [1], [2], [3], [4] is the same as the default here, but it's recommended to pass higher values if using the model for purposes other than outlier detection.
If passing `NULL` or zero, will not limit the depth of the trees (that is, will grow them until each observation is isolated or until no further split is possible).
Note that models that use `prob_pick_pooled_gain` or `prob_pick_avg_gain` are likely to benefit from deeper trees (larger `max_depth`), but deeper trees can result in much slower model fitting and predictions.
If using pooled gain, one might want to substitute `max_depth` with `min_gain`.
Number of columns to use (have as potential candidates for splitting at each iteration) in each tree, somewhat similar to the 'mtry' parameter of random forests. In general, this is only relevant when using non-random splits and/or weighted column choices.
If passing a number between zero and one, will assume it means taking a sample size that represents that proportion of the columns in the data. Note that, if passing exactly 1, will assume it means taking 100% of the columns, rather than taking a single column.
If passing `NULL`, will use the full number of columns in the data.
his parameter indicates the probability of choosing the threshold on which to split a variable (with `ndim=1`) or a linear combination of variables (when using `ndim>1`) as the threshold that maximizes a pooled standard deviation gain criterion (see references [9] and [11]) on the same variable or linear combination, similarly to regression trees such as CART.
If using `ntry>1`, will try several variables or linear combinations thereof and choose the one in which the largest standardized gain can be achieved.
For categorical variables with `ndim=1`, will use shannon entropy instead (like in [7]).
Compared to a simple averaged gain, this tends to result in more evenly-divided splits and more clustered groups when they are smaller. Recommended to pass higher values when used for imputation of missing values. When used for outlier detection, datasets with multimodal distributions usually see better performance under this type of splits.
Note that, since this makes the trees more even and thus it takes more steps to produce isolated nodes, the resulting object will be heavier. When splits are not made according to any of `prob_pick_avg_gain`, `prob_pick_pooled_gain`, `prob_pick_full_gain`, `prob_pick_dens`, both the column and the split point are decided at random. Note that, if passing value 1 (100%) with no sub-sampling and using the single-variable model, every single tree will have the exact same splits.
Be aware that `penalize_range` can also have a large impact when using `prob_pick_pooled_gain`.
Under this option, models are likely to produce better results when increasing `max_depth`. Alternatively, one can also control the depth through `min_gain` (for which one might want to set `max_depth=NULL`).
Important detail: if using any of `prob_pick_avg_gain` or `prob_pick_pooled_gain`, `prob_pick_full_gain`, `prob_pick_dens`, the distribution of outlier scores is unlikely to be centered around 0.5.
This parameter indicates the probability of choosing the threshold on which to split a variable (with `ndim=1`) or a linear combination of variables (when using `ndim>1`) as the threshold that maximizes an averaged standard deviation gain criterion (see references [4] and [11]) on the same variable or linear combination.
If using `ntry>1`, will try several variables or linear combinations thereof and choose the one in which the largest standardized gain can be achieved.
For categorical variables with `ndim=1`, will take the expected standard deviation that would be gotten if the column were converted to numerical by assigning to each category a random number `~ Unif(0, 1)` and calculate gain with those assumed standard deviations.
Compared to a pooled gain, this tends to result in more cases in which a single observation or very few of them are put into one branch. Typically, datasets with outliers defined by extreme values in some column more or less independently of the rest, usually see better performance under this type of split. Recommended to use sub-samples (parameter `sample_size`) when passing this parameter. Note that, since this will create isolated nodes faster, the resulting object will be lighter (use less memory).
When splits are not made according to any of `prob_pick_avg_gain`, `prob_pick_pooled_gain`, `prob_pick_full_gain`, `prob_pick_dens`, both the column and the split point are decided at random. Default setting for [1], [2], [3] is zero, and default for [4] is 1. This is the randomization parameter that can be passed to the author's original code in [5], but note that the code in [5] suffers from a mathematical error in the calculation of running standard deviations, so the results from it might not match with this library's.
Be aware that, if passing a value of 1 (100%) with no sub-sampling and using the single-variable model, every single tree will have the exact same splits.
Under this option, models are likely to produce better results when increasing `max_depth`.
Important detail: if using either `prob_pick_avg_gain`, `prob_pick_pooled_gain`, `prob_pick_full_gain`, `prob_pick_dens`, the distribution of outlier scores is unlikely to be centered around 0.5.
This parameter indicates the probability of choosing the threshold on which to split a variable (with `ndim=1`) or a linear combination of variables (when using `ndim>1`) as the threshold that minimizes the pooled sums of variances of all columns (or a subset of them if using `ncols_per_tree`).
In general, this is much slower to evaluate than the other gain types, and does not tend to lead to better results. When using this option, one might want to use a different scoring metric (particulatly `"density"`, `"boxed_density2"` or `"boxed_ratio"`). Note that the calculations are all done through the (exact) sorted-indices approach, while is much slower than the (approximate) histogram approach used by other decision tree software.
Be aware that the data is not standardized in any way for the variance calculations, thus the scales of features will make a large difference under this option, which might not make it suitable for all types of data.
his option is not compatible with categorical data, and `min_gain` does not apply to it.
When splits are not made according to any of `prob_pick_avg_gain`, `prob_pick_pooled_gain`, `prob_pick_full_gain`, `prob_pick_dens`, both the column and the split point are decided at random. Default setting for references [1], [2], [3], [4] is zero.
This parameter indicates the probability of choosing the threshold on which to split a variable (with `ndim=1`) or a linear combination of variables (when using `ndim>1`) as the threshold that maximizes the pooled densities of the branch distributions.
The `min_gain` option does not apply to this type of splits.
When splits are not made according to any of `prob_pick_avg_gain`, `prob_pick_pooled_gain`, `prob_pick_full_gain`, `prob_pick_dens`, both the column and the split point are decided at random. Default setting for [1], [2], [3], [4] is zero.
When using `ndim=1`, this denotes the probability of choosing the column to split with a probability proportional to the range spanned by each column within a node as proposed in reference [12].
When using `ndim>1`, this denotes the probability of choosing columns to create a hyperplane with a probability proportional to the range spanned by each column within a node.
This option is not compatible with categorical data. If passing column weights, the effect will be multiplicative.
Be aware that the data is not standardized in any way for the range calculations, thus the scales of features will make a large difference under this option, which might not make it suitable for all types of data.
If there are infinite values, all columns having infinite values will be treated as having the same weight, and will be chosen before every other column with non-infinite values.
Note that the proposed RRCF model from [12] uses a different scoring metric for producing anomaly scores, while this library uses isolation depth regardless of how columns are chosen, thus results are likely to be different from those of other software implementations. Nevertheless, as explored in [11], isolation depth as a scoring metric typically provides better results than the "co-displacement" metric from [12] under these split types.
When using `ndim=1`, this denotes the probability of choosing the column to split with a probability proportional to the variance of each column within a node.
When using `ndim>1`, this denotes the probability of choosing columns to create a hyperplane with a probability proportional to the variance of each column within a node.
For categorical data, it will calculate the expected variance if the column were converted to numerical by assigning to each category a random number `~ Unif(0, 1)`, which depending on the number of categories and their distribution, produces numbers typically a bit smaller than standardized numerical variables.
Note that when using sparse matrices, the calculation of variance will rely on a procedure that uses sums of squares, which has less numerical precision than the calculation used for dense inputs, and as such, the results might differ slightly.
Be aware that this calculated variance is not standardized in any way, so the scales of features will make a large difference under this option.
If passing column weights, the effect will be multiplicative.
If passing a `missing_action` different than "fail", infinite values will be ignored for the variance calculation. Otherwise, all columns with infinite values will have the same probability and will be chosen before columns with non-infinite values.
When using `ndim=1`, this denotes the probability of choosing the column to split with a probability proportional to the kurtosis of each column within a node (unlike the option `weigh_by_kurtosis` which calculates this metric only at the root).
When using `ndim>1`, this denotes the probability of choosing columns to create a hyperplane with a probability proportional to the kurtosis of each column within a node.
For categorical data, it will calculate the expected kurtosis if the column were converted to numerical by assigning to each category a random number `~ Unif(0, 1)`.
Note that when using sparse matrices, the calculation of kurtosis will rely on a procedure that uses sums of squares and higher-power numbers, which has less numerical precision than the calculation used for dense inputs, and as such, the results might differ slightly.
If passing column weights, the effect will be multiplicative. This option is not compatible with `weigh_by_kurtosis`.
If passing a `missing_action` different than "fail", infinite values will be ignored for the kurtosis calculation. Otherwise, all columns with infinite values will have the same probability and will be chosen before columns with non-infinite values.
If using `missing_action="impute"`, the calculation of kurtosis will not use imputed values in order not to favor columns with missing values (which would increase kurtosis by all having the same central value).
Be aware that kurtosis can be a rather slow metric to calculate.
Minimum gain that a split threshold needs to produce in order to proceed with a split. Only used when the splits are decided by a variance gain criterion (`prob_pick_pooled_gain` or `prob_pick_avg_gain`, but not `prob_pick_full_gain` nor `prob_pick_dens`). If the highest possible gain in the evaluated splits at a node is below this threshold, that node becomes a terminal node.
This can be used as a more sophisticated depth control when using pooled gain (note that `max_depth` still applies on top of this heuristic).
How to handle missing data at both fitting and prediction time. Options are
`"divide"` (for the single-variable model only, recommended), which will follow both branches and combine the result with the weight given by the fraction of the data that went to each branch when fitting the model.
`"impute"`, which will assign observations to the branch with the most observations in the single-variable model, or fill in missing values with the median of each column of the sample from which the split was made in the extended model (recommended for it) (but note that the calculation of medians does not take into account sample weights when using `weights_as_sample_prob=FALSE`). When using `ndim=1`, gain calculations will use median-imputed values for missing data under this option.
`"fail"`, which will assume there are no missing values and will trigger undefined behavior if it encounters any.
In the extended model, infinite values will be treated as missing. Passing `"fail"` will produce faster fitting and prediction times along with decreased model object sizes.
Models from references [1], [2], [3], [4] correspond to `"fail"` here.
Typically, models with `ndim>1` are less affected by missing data that models with `ndim=1`.
What to do after splitting a categorical feature when new data that reaches that split has categories that the sub-sample from which the split was done did not have. Options are
`"weighted"` (for the single-variable model only, recommended), which will follow both branches and combine the result with weight given by the fraction of the data that went to each branch when fitting the model.
`"impute"` (for the extended model only, recommended) which will assign them the median value for that column that was added to the linear combination of features (but note that this median calculation does not use sample weights when using `weights_as_sample_prob=FALSE`).
`"smallest"`, which in the single-variable case will assign all observations with unseen categories in the split to the branch that had fewer observations when fitting the model, and in the extended case will assign them the coefficient of the least common category.
`"random"`, which will assing a branch (coefficient in the extended model) at random for each category beforehand, even if no observations had that category when fitting the model. Note that this can produce biased results when deciding splits by a gain criterion.
Important: under this option, if the model is fitted to a `data.frame`, when calling `predict` on new data which contains new factor levels (unseen in the data to which the model was fitted), they will be added to the model's state on-the-fly. This means that, if calling `predict` on data which has new categories, there might be inconsistencies in the results if predictions are done in parallel or if passing the same data in batches or with different row orders.
Ignored when passing `categ_split_type` = `"single_categ"`.
Whether to split categorical features by assigning sub-sets of them to each branch (by passing `"subset"` there), or by assigning a single category to a branch and the rest to the other branch (by passing `"single_categ"` here). For the extended model, whether to give each category a coefficient (`"subset"`), or only one while the rest get zero (`"single_categ"`).
When doing categorical variable splits by pooled gain with `ndim=1` (single-variable model), whether to consider all possible permutations of variables to assign to each branch or not. If `FALSE`, will sort the categories by their frequency and make a grouping in this sorted order. Note that the number of combinations evaluated (if `TRUE`) is the factorial of the number of present categories in a given column (minus 2). For averaged gain, the best split is always to put the second most-frequent category in a separate branch, so not evaluating all permutations (passing `FALSE`) will make it possible to select other splits that respect the sorted frequency order. Ignored when not using categorical variables or not doing splits by pooled gain or using `ndim>1`.
In the extended model, whether to sort the randomly-generated coefficients for categories according to their relative frequency in the tree node. This might provide better results when using categorical variables with too many categories, but is not recommended, and not reflective of real "categorical-ness". Ignored for the single-variable model (`ndim=1`) and/or when not using categorical variables.
Whether to re-encode categorical variables even in case they are already passed as factors. This is recommended as it will eliminate potentially redundant categorical levels if they have no observations, but if the categorical variables are already of type `factor` with only the levels that are present, it can be skipped for slightly faster fitting times. You'll likely want to pass `FALSE` here if merging several models into one through isotree.append.trees.
If passing sample (row) weights when fitting the model, whether to consider those weights as row sampling weights (i.e. the higher the weights, the more likely the observation will end up included in each tree sub-sample), or as distribution density weights (i.e. putting a weight of two is the same as if the row appeared twice, thus higher weight makes it less of an outlier, but does not give it a higher chance of being sampled if the data uses sub-sampling).
Whether to sample rows with replacement or not (not recommended). Note that distance calculations, if desired, don't work when there are duplicate rows.
This option is not compatible with `output_score`, `output_dist`, `output_imputations`.
Whether to penalize (add -1 to the terminal depth) observations at prediction time that have a value of the chosen split variable (linear combination in extended model) that falls outside of a pre-determined reasonable range in the data being split (given by `2 * range` in data and centered around the split point), as proposed in reference [4] and implemented in the authors' original code in reference [5]. Not used in single-variable model when splitting by categorical variables.
This option is not supported when using density-based outlier scoring metrics.
It's recommended to turn this off for faster predictions on sparse CSC matrices.
Note that this can make a very large difference in the results when using `prob_pick_pooled_gain`.
Be aware that this option can make the distribution of outlier scores a bit different (i.e. not centered around 0.5).
Whether to standardize the features at each node before creating alinear combination of them as suggested in [4]. This is ignored when using `ndim=1`.
Metric to use for determining outlier scores (see reference [13]). Options are:
"depth": Will use isolation depth as proposed in reference [1]. This is typically the safest choice and plays well with all model types offered by this library.
"density": Will set scores for each terminal node as the ratio between the fraction of points in the sub-sample that end up in that node and the fraction of the volume in the feature space which defines the node according to the splits that lead to it. If using `ndim=1`, for categorical variables, this is defined in terms of number of categories that go towards each side of the split divided by number of categories in the observations that reached that node.
The standardized outlier score from density for a given observation is calculated as the negative of the logarithm of the geometric mean from the per-tree densities, which unlike the standardized score produced from depth, is unbounded, but just like the standardized score from depth, has a natural threshold for definining outlierness, which in this case is zero is instead of 0.5. The non-standardized outlier score is calculated as the geometric mean, while the per-tree scores are calculated as the density values.
This might lead to better predictions when using `ndim=1`, particularly in the presence of categorical variables. Note however that using density requires more trees for convergence of scores (i.e. good results) compared to isolation-based metrics.
This option is incompatible with `penalize_range`.
"adj_depth": Will use an adjusted isolation depth that takes into account the number of points that go to each side of a given split vs. the fraction of the range of that feature that each side of the split occupies, by a metric as follows:
\(d = \frac{2}{(1 + \frac{1}{2 p}}\)
Where \(p\) is defined as:
\(p = \frac{n_s}{n_t} / \frac{r_s}{r_t}\)
With \(n_t\) being the number of points that reach a given node, \(n_s\) the number of points that are sent to a given side of the split/branch at that node, \(r_t\) being the range (maximum minus minimum) of the splitting feature or linear combination among the points that reached the node, and \(r_s\) being the range of the same feature or linear combination among the points that are sent to this same side of the split/branch. This makes each split add a number between zero and two to the isolation depth, with this number's probabilistic distribution being centered around 1 and thus the expected isolation depth remaing the same as in the original `"depth"` metric, but having more variability around the extremes.
Scores (standardized, non-standardized, per-tree) are aggregated in the same way as for `"depth"`.
This might lead to better predictions when using `ndim=1`, particularly in the prescence of categorical variables and for smaller datasets, and for smaller datasets, might make sense to combine it with `penalize_range=TRUE`.
"adj_density": Will use the same metric from `"adj_depth"`, but applied multiplicatively instead of additively. The expected value for this adjusted density is not strictly the same as for isolation, but using the expected isolation depth as standardizing criterion tends to produce similar standardized score distributions (centered around 0.5).
Scores (standardized, non-standardized, per-tree) are aggregated in the same way as for `"depth"`.
This option is incompatible with `penalize_range`.
"boxed_ratio": Will set the scores for each terminal node as the ratio between the volume of the boxed feature space for the node as defined by the smallest and largest values from the split conditions for each column (bounded by the variable ranges in the sample) and the variable ranges in the tree sample. If using `ndim=1`, for categorical variables this is defined in terms of number of categories. If using `ndim=>1`, this is defined in terms of the maximum achievable value for the splitting linear combination determined from the minimum and maximum values for each variable among the points in the sample, and as such, it has a rather different meaning compared to the score obtained with `ndim=1` - boxed ratio scores with `ndim>1` typically provide very poor quality results and this metric is thus not recommended to use in the extended model. With `ndim>1`, it also has a tendency of producing too small values which round to zero.
The standardized outlier score from boxed ratio for a given observation is calculated simply as the the average from the per-tree boxed ratios. This metric has a lower bound of zero and a theorical upper bound of one, but in practice the scores tend to be very small numbers close to zero, and its distribution across different datasets is rather unpredictable. In order to keep rankings comparable with the rest of the metrics, the non-standardized outlier scores are calculated as the negative of the average instead. The per-tree scores are calculated as the ratios.
This metric can be calculated in a fast-but-not-so-precise way, and in a low-but-precise way, which is controlled by parameter `fast_bratio`. Usually, both should give the same results, but in some fatasets, the fast way can lead to numerical inaccuracies due to roundoffs very close to zero.
This metric might lead to better predictions in datasets with many rows when using `ndim=1` and a relatively small `sample_size`. Note that more trees are required for convergence of scores when using this metric. In some datasets, this metric might result in very bad predictions, to the point that taking its inverse produces a much better ranking of outliers.
This option is incompatible with `penalize_range`.
"boxed_density2": Will set the score as the ratio between the fraction of points within the sample that end up in a given terminal node and the boxed ratio metric.
Aggregation of scores (standardized, non-standardized, per-tree) is done in the same way as for density, and it also has a natural threshold at zero for determining outliers and inliers.
This metric is typically usable with `ndim>1`, but tends to produce much bigger values compared to `ndim=1`.
Albeit unintuitively, in many datasets, one can usually get better results with metric `"boxed_density"` instead.
The calculation of this metric is also controlled by `fast_bratio`.
This option is incompatible with `penalize_range`.
"boxed_density": Will set the score as the ratio between the fraction of points within the sample that end up in a given terminal node and the ratio between the boxed volume of the feature space in the sample and the boxed volume of a node given by the split conditions (inverse as in `"boxed_density2"`). This metric does not have any theoretical or intuitive justification behind its existence, and it is perhaps ilogical to use it as a scoring metric, but tends to produce good results in some datasets.
The standardized outlier scores are defined as the negative of the geometric mean of this metric, while the non-standardized scores are the geometric mean, and the per-tree scores are simply the 'density' values.
The calculation of this metric is also controlled by `fast_bratio`.
This option is incompatible with `penalize_range`.
When using "boxed" metrics for scoring, whether to calculate them in a fast way through cumulative sum of logarithms of ratios after each split, or in a slower way as sum of logarithms of a single ratio per column for each terminal node.
Usually, both methods should give the same results, but in some datasets, particularly when variables have too small or too large ranges, the first method can be prone to numerical inaccuracies due to roundoff close to zero.
Note that this does not affect calculations for models with `ndim>1`, since given the split types, the calculation for them is different.
Whether to weigh each column according to the kurtosis obtained in the sub-sample that is selected for each tree as briefly proposed in reference [1]. Note that this is only done at the beginning of each tree sample. For categorical columns, will calculate expected kurtosis if the column were converted to numerical by assigning to each category a random number `~ Unif(0, 1)`.
Note that when using sparse matrices, the calculation of kurtosis will rely on a procedure that uses sums of squares and higher-power numbers, which has less numerical precision than the calculation used for dense inputs, and as such, the results might differ slightly.
Using this option makes the model more likely to pick the columns that have anomalous values when viewed as a 1-d distribution, and can bring a large improvement in some datasets.
This is intended as a cheap feature selector, while the parameter `prob_pick_col_by_kurt` provides the option to do this at each node in the tree for a different overall type of model.
If passing column weights or using weighted column choices proportional to some other metric (`prob_pick_col_by_range`, `prob_pick_col_by_var`), the effect will be multiplicative.
If passing `missing_action="fail"` and the data has infinite values, columns with rows having infinite values will get a weight of zero. If passing a different value for missing action, infinite values will be ignored in the kurtosis calculation.
If using `missing_action="impute"`, the calculation of kurtosis will not use imputed values in order not to favor columns with missing values (which would increase kurtosis by all having the same central value).
For the extended model, whether to sample random coefficients according to a normal distribution `~ N(0, 1)` (as proposed in reference [4]) or according to a uniform distribution `~ Unif(-1, +1)` as proposed in reference [3]. Ignored for the single-variable model. Note that, for categorical variables, the coefficients will be sampled ~ N (0,1) regardless - in order for both types of variables to have transformations in similar ranges (which will tend to boost the importance of categorical variables), pass `"uniform"` here.
When calculating pairwise distances (see reference [8]), whether to assume that the fitted model represents a full population distribution (will use a standardizing criterion assuming infinite sample as in reference [6], and the results of the similarity between two points at prediction time will not depend on the prescence of any third point that is similar to them, but will differ more compared to the pairwise distances between points from which the model was fit). If passing `FALSE`, will calculate pairwise distances as if the new observations at prediction time were added to the sample to which each tree was fit, which will make the distances between two points potentially vary according to other newly introduced points. This will not be assumed when the distances are calculated as the model is being fit (see documentation for parameter `output_dist`).
This was added for experimentation purposes only and it's not recommended to pass `FALSE`. Note that when calculating distances using a tree indexer (after calling isotree.build.indexer), there might be slight discrepancies between the numbers produced with or without the indexer due to what are considered "additional" observations in this calculation.
Whether to construct missing-value imputers so that later this same model could be used to impute missing values of new (or the same) observations. Be aware that this will significantly increase the memory requirements and serialized object sizes. Note that this is not related to 'missing_action' as missing values inside the model are treated differently and follow their own imputation or division strategy.
Whether to output imputed missing values for `data`. Passing `TRUE` here will force `build_imputer` to `TRUE`. Note that, for sparse matrix inputs, even though the output will be sparse, it will generate a dense representation of each row with missing values.
This is not supported when using sub-sampling, and if sub-sampling is specified, will override it using the full number of rows.
Minimum number of observations with which an imputation value can be produced. Ignored if passing `build_imputer` = `FALSE`.
How to weight observations according to their depth when used for imputing missing values. Passing `"higher"` will weigh observations higher the further down the tree (away from the root node) the terminal node is, while `"lower"` will do the opposite, and `"same"` will not modify the weights according to node depth in the tree. Implemented for testing purposes and not recommended to change from the default. Ignored when passing `build_imputer` = `FALSE`.
How to weight node sizes when used for imputing missing values. Passing `"inverse"` will weigh a node inversely proportional to the number of observations that end up there, while `"prop"` will weight them heavier the more observations there are, and `"flat"` will weigh all nodes the same in this regard regardless of how many observations end up there. Implemented for testing purposes and not recommended to change from the default. Ignored when passing `build_imputer` = `FALSE`.
Whether to output outlierness scores for the input data, which will be calculated as the model is being fit and it's thus faster. Cannot be done when using sub-samples of the data for each tree (in such case will later need to call the `predict` function on the same data). If using `penalize_range`, the results from this might differet a bit from those of `predict` called after.
This is not supported when using sub-sampling, and if sub-sampling is specified, will override it using the full number of rows.
Whether to output pairwise distances for the input data, which will be calculated as the model is being fit and it's thus faster. Cannot be done when using sub-samples of the data for each tree (in such case will later need to call the `predict` function on the same data). If using `penalize_range`, the results from this might differ a bit from those of `predict` called after.
This is not supported when using sub-sampling, and if sub-sampling is specified, will override it using the full number of rows.
Note that it might be much faster to calculate distances through a fitted model object with isotree.build.indexer instead or calculating them while fitting like this.
If passing `output_dist` = `TRUE`, whether to return a full square matrix or just the upper-triangular part, in which the entry for pair (i,j) with 1 <= i < j <= n is located at position p(i, j) = ((i - 1) * (n - i/2) + j - i).
Sample observation weights for each row of `data`, with higher weights indicating either higher sampling probability (i.e. the observation has a larger effect on the fitted model, if using sub-samples), or distribution density (i.e. if the weight is two, it has the same effect of including the same data point twice), according to parameter `weights_as_sample_prob`. Not supported when calculating pairwise distances while the model is being fit (done by passing `output_dist` = `TRUE`).
If `data` is a `data.frame` and the variable passed here matches to the name of a column in `data` (with or without enclosing `sample_weights` in quotes), it will assume the weights are to be taken as that column name.
Sampling weights for each column in `data`. Ignored when picking columns by deterministic criterion. If passing `NULL`, each column will have a uniform weight. If used along with kurtosis weights, the effect is multiplicative.
Note that, if passing a data.frame with both numeric and categorical columns, the column names must not be repeated, otherwise the column weights passed here will not end up matching. If passing a `data.frame` to `data`, will assume the column order is the same as in there, regardless of whether the entries passed to `column_weights` are named or not.
Seed that will be used for random number generation.
Whether to use 'long double' (extended precision) type for more precise calculations about standard deviations, means, ratios, weights, gain, and other potential aggregates. This makes such calculations accurate to a larger number of decimals (provided that the compiler used has wider long doubles than doubles) and it is highly recommended to use when the input data has a number of rows or columns exceeding \(2^{53}\) (an unlikely scenario), and also highly recommended to use when the input data has problematic scales (e.g. numbers that differ from each other by something like \(10^{-100}\) or columns that include values like \(10^{100}\), \(10^{-10}\), and \(10^{-100}\) and still need to be sensitive to a difference of \(10^{-10}\)), but will make the calculations slower, the more so in platforms in which 'long double' is a software-emulated type (e.g. Power8 platforms). Note that some platforms (most notably windows with the msvc compiler) do not make any difference between 'double' and 'long double'.
If 'long double' is not going to be used, the library can be compiled without support for it (making the library size smaller) by defining an environment variable `NO_LONG_DOUBLE` before installing this package (e.g. through `Sys.setenv("NO_LONG_DOUBLE" = "1")` before running the `install.packages` command). If R itself was compiled without 'long double' support, this library will follow suit and disable long double too.
This option is not available on Windows, due to lack of support in some compilers (e.g. msvc) and lack of thread-safety in the calculations in others (e.g. mingw).
Number of parallel threads to use. If passing a negative number, will use the maximum number of available threads in the system. Note that, the more threads, the more memory will be allocated, even if the thread does not end up being used. Be aware that most of the operations are bound by memory bandwidth, which means that adding more threads will not result in a linear speed-up. For some types of data (e.g. large sparse matrices with small sample sizes), adding more threads might result in only a very modest speed up (e.g. 1.5x faster with 4x more threads), even if all threads look fully utilized.
Shorthands for parameter combinations that match some of the references:
'iForest' (reference [1]): `ndim=1`, `sample_size=256`, `max_depth=8`, `ntrees=100`, `missing_action="fail"`.
'EIF' (reference [3]): `ndim=2`, `sample_size=256`, `max_depth=8`, `ntrees=100`, `missing_action="fail"`, `coefs="uniform"`, `standardize_data=False` (plus standardizing the data before passing it).
'SCiForest' (reference [4]): `ndim=2`, `sample_size=256`, `max_depth=8`, `ntrees=100`, `missing_action="fail"`, `coefs="normal"`, `ntry=10`, `prob_pick_avg_gain=1`, `penalize_range=True`. Might provide much better results with `max_depth=NULL` despite the reference's recommendation.
'FCF' (reference [11]): `ndim=2`, `sample_size=256`, `max_depth=NULL`, `ntrees=200`, `missing_action="fail"`, `coefs="normal"`, `ntry=1`, `prob_pick_pooled_gain=1`. Might provide similar or better results with `ndim=1` and/or sample size as low as 32. For the FCF model aimed at imputing missing values, might give better results with `ntry=10` or higher and much larger sample sizes.
'RRCF' (reference [12]): `ndim=1`, `prob_pick_col_by_range=1`, `sample_size=256` or more, `max_depth=NULL`, `ntrees=100` or more, `missing_action="fail"`. Note however that reference [12] proposed a different method for calculation of anomaly scores, while this library uses isolation depth just like for 'iForest', so results might differ significantly from those of other libraries. Nevertheless, experiments in reference [11] suggest that isolation depth might be a better scoring metric for this model.
If the model is built with `nthreads>1`, the prediction function predict.isolation_forest will use OpenMP for parallelization. In a linux setup, one usually has GNU's "gomp" as OpenMP as backend, which will hang when used in a forked process - for example, if one tries to call this prediction function from `RestRserve`, which uses process forking for parallelization, it will cause the whole application to freeze; and if using kubernetes on top of a different backend such as plumber, might cause it to run slower than needed or to hang too. A potential fix in these cases is to set the number of threads to 1 in the object (e.g. `model$nthreads <- 1L`), or to use a different version of this library compiled without OpenMP (requires manually altering the `Makevars` file), or to use a non-GNU OpenMP backend. This should not be an issue when using this library normally in e.g. an RStudio session.
In order to make model objects serializable (i.e. usable with `save`, `saveRDS`, and similar), these model objects keep serialized raw bytes from which their underlying heap-allocated C++ object (which does not survive serializations) can be reconstructed. For model serving, one would usually want to drop these serialized bytes after having loaded a model through `readRDS` or similar (note that reconstructing the C++ object will first require calling isotree.restore.handle, which is done automatically when calling `predict` and similar), as they can increase memory usage by a large amount. These redundant raw bytes can be dropped as follows: `model$cpp_obj$serialized <- NULL` (and an additional `model$cpp_obj$imp_ser <- NULL` when using `build_imputer=TRUE` and `model$cpp_obj$ind_ser <- NULL` when building a node indexer). After that, one might want to force garbage collection through `gc()`.
Usually, for serving purposes, one wants a setup as minimalistic as possible (e.g. smaller docker images). This library can be made smaller and faster to compile by disabling some features - particularly, the library will by default build with support for calculation of aggregated metrics (such as standard deviations) in 'long double' precision (an extended precision type), which is a functionality that's unlikely to get used (default is not to use this type as it is slower, and calculations done in the `predict` function do not use it for anything). Support for 'long double' can be disable at compile time by setting up an environment variable `NO_LONG_DOUBLE` before installing the package (e.g. by issuing command `Sys.setenv("NO_LONG_DOUBLE" = "1")` before `install.packages`).
If requesting outlier scores or depths or separation/distance while fitting the model and using multiple threads, there can be small differences in the predicted scores/depth/separation/distance between runs due to roundoff error.
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest." 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008.
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
Hariri, Sahand, Matias Carrasco Kind, and Robert J. Brunner. "Extended Isolation Forest." arXiv preprint arXiv:1811.02141 (2018).
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "On detecting clustered anomalies using SCiForest." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2010.
Quinlan, J. Ross. "C4. 5: programs for machine learning." Elsevier, 2014.
Cortes, David. "Distance approximation using Isolation Forests." arXiv preprint arXiv:1910.12362 (2019).
Cortes, David. "Imputing missing values with unsupervised random trees." arXiv preprint arXiv:1911.06646 (2019).
Cortes, David. "Revisiting randomized choices in isolation forests." arXiv preprint arXiv:2110.13402 (2021).
Guha, Sudipto, et al. "Robust random cut forest based anomaly detection on streams." International conference on machine learning. PMLR, 2016.
Cortes, David. "Isolation forests: looking beyond tree depth." arXiv preprint arXiv:2111.11639 (2021).
Ting, Kai Ming, Yue Zhu, and Zhi-Hua Zhou. "Isolation kernel and its effect on SVM." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
predict.isolation_forest, isotree.add.tree isotree.restore.handle
### Example 1: detect an obvious outlier
### (Random data from a standard normal distribution)
library(isotree)
set.seed(1)
m <- 100
n <- 2
X <- matrix(rnorm(m * n), nrow = m)
### Will now add obvious outlier point (3, 3) to the data
X <- rbind(X, c(3, 3))
### Fit a small isolation forest model
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
### Check which row has the highest outlier score
pred <- predict(iso, X)
cat("Point with highest outlier score: ",
X[which.max(pred), ], "\n")
### Example 2: plotting outlier regions
### This example shows predicted outlier score in a small
### grid, with a model fit to a bi-modal distribution. As can
### be seen, the extended model is able to detect high
### outlierness outside of both regions, without having false
### ghost regions of low-outlierness where there isn't any data
library(isotree)
oldpar <- par(mfrow = c(2, 2), mar = c(2.5,2.2,2,2.5))
### Randomly-generated data from different distributions
set.seed(1)
group1 <- data.frame(x = rnorm(1000, -1, .4),
y = rnorm(1000, -1, .2))
group2 <- data.frame(x = rnorm(1000, +1, .2),
y = rnorm(1000, +1, .4))
X = rbind(group1, group2)
### Add an obvious outlier which is within the 1d ranges
### (As an interesting test, remove the outlier and see what happens,
### or check how its score changes when using sub-sampling or
### changing the scoring metric for 'ndim=1')
X = rbind(X, c(-1, 1))
### Produce heatmaps
pts = seq(-3, 3, .1)
space_d <- expand.grid(x = pts, y = pts)
plot.space <- function(Z, ttl) {
image(pts, pts, matrix(Z, nrow = length(pts)),
col = rev(heat.colors(50)),
main = ttl, cex.main = 1.4,
xlim = c(-3, 3), ylim = c(-3, 3),
xlab = "", ylab = "")
par(new = TRUE)
plot(X, type = "p", xlim = c(-3, 3), ylim = c(-3, 3),
col = "#0000801A",
axes = FALSE, main = "",
xlab = "", ylab = "")
}
### Now try out different variations of the model
### Single-variable model
iso_simple = isolation.forest(
X, ndim=1,
ntrees=100,
nthreads=1,
penalize_range=FALSE,
prob_pick_pooled_gain=0,
prob_pick_avg_gain=0)
Z1 <- predict(iso_simple, space_d)
plot.space(Z1, "Isolation Forest")
### Extended model
iso_ext = isolation.forest(
X, ndim=2,
ntrees=100,
nthreads=1,
penalize_range=FALSE,
prob_pick_pooled_gain=0,
prob_pick_avg_gain=0)
Z2 <- predict(iso_ext, space_d)
plot.space(Z2, "Extended Isolation Forest")
### SCiForest
iso_sci = isolation.forest(
X, ndim=2, ntry=1,
coefs="normal",
ntrees=100,
nthreads=1,
penalize_range=TRUE,
prob_pick_pooled_gain=0,
prob_pick_avg_gain=1)
Z3 <- predict(iso_sci, space_d)
plot.space(Z3, "SCiForest")
### Fair-cut forest
iso_fcf = isolation.forest(
X, ndim=2,
ntrees=100,
nthreads=1,
penalize_range=FALSE,
prob_pick_pooled_gain=1,
prob_pick_avg_gain=0)
Z4 <- predict(iso_fcf, space_d)
plot.space(Z4, "Fair-Cut Forest")
par(oldpar)
### (As another interesting variation, try setting
### 'penalize_range=TRUE' for the last model)
### Example 3: calculating pairwise distances,
### with a short validation against euclidean dist.
library(isotree)
### Generate random data with 3 dimensions
set.seed(1)
m <- 100
n <- 3
X <- matrix(rnorm(m * n), nrow=m, ncol=n)
### Fit isolation forest model
iso <- isolation.forest(X, ntrees=100, nthreads=1)
### Calculate distances with the model
### (this can be accelerated with 'isotree.build.indexer')
D_iso <- predict(iso, X, type = "dist")
### Check that it correlates with euclidean distance
D_euc <- dist(X, method = "euclidean")
cat(sprintf("Correlation with euclidean distance: %f\n",
cor(D_euc, D_iso)))
### (Note that euclidean distance will never take
### any correlations between variables into account,
### which the isolation forest model can do)
# \donttest{
### Example 4: imputing missing values
### (requires package MASS)
library(isotree)
### Generate random data, set some values as NA
if (require("MASS")) {
set.seed(1)
S <- crossprod(matrix(rnorm(5 * 5), nrow = 5))
mu <- rnorm(5)
X <- MASS::mvrnorm(1000, mu, S)
X_na <- X
values_NA <- matrix(runif(1000 * 5) < .15, nrow = 1000)
X_na[values_NA] = NA
### Impute missing values with model
iso <- isolation.forest(X_na,
build_imputer = TRUE,
prob_pick_pooled_gain = 1,
ntry = 10,
nthreads = 1)
X_imputed <- predict(iso, X_na, type = "impute")
cat(sprintf("MSE for imputed values w/model: %f\n",
mean((X[values_NA] - X_imputed[values_NA])^2)))
### Compare against simple mean imputation
X_means <- apply(X, 2, mean)
X_imp_mean <- X_na
for (cl in 1:5)
X_imp_mean[values_NA[,cl], cl] <- X_means[cl]
cat(sprintf("MSE for imputed values w/means: %f\n",
mean((X[values_NA] - X_imp_mean[values_NA])^2)))
}
# }
# \donttest{
#### A more interesting example
#### (requires package outliertree)
### Compare outliers returned by these different methods,
### and see why some of the outliers returned by the
### isolation forest could be flagged as outliers
if (require("outliertree")) {
hypothyroid <- outliertree::hypothyroid
iso <- isolation.forest(hypothyroid, nthreads=1)
pred_iso <- predict(iso, hypothyroid)
otree <- outliertree::outlier.tree(
hypothyroid,
z_outlier = 6,
pct_outliers = 0.02,
outliers_print = 20,
nthreads = 1)
### Now compare against the top
### outliers from isolation forest
head(hypothyroid[order(-pred_iso), ], 20)
}
# }
Run the code above in your browser using DataLab