Learn R Programming

isotree (version 0.5.15)

isotree.to.sql: Generate SQL statements from Isolation Forest model

Description

Generate SQL statements - either separately per tree (the default), for a single tree if needed (if passing `tree`), or for all trees concatenated together (if passing `table_from`). Can also be made to output terminal node numbers (numeration starting at one).

Some important considerations:

  • Making predictions through SQL is much less efficient than from the model itself, as each terminal node will have to check all of the conditions that lead to it instead of passing observations down a tree.

  • If constructed with the default arguments, the model will not perform any sub-sampling, which can lead to very big trees. If it was fit to a large dataset, the generated SQL might consist of gigabytes of text, and might lay well beyond the character limit of commands accepted by SQL vendors.

  • The generated SQL statements will not include range penalizations, thus predictions might differ from calls to `predict` when using `penalize_range=TRUE`.

  • The generated SQL statements will only include handling of missing values when using `missing_action="impute"`. When using the single-variable model with categorical variables + subset splits, the rule buckets might be incomplete due to not including categories that were not present in a given node - this last point can be avoided by using `new_categ_action="smallest"`, `new_categ_action="random"`, or `missing_action="impute"` (in the latter case will treat them as missing, but the `predict` function might treat them differently).

  • The resulting statements will include all the tree conditions as-is, with no simplification. Thus, there might be lots of redundant conditions in a given terminal node (e.g. "X > 2" and "X > 1", the second of which is redundant).

  • If using `scoring_metric="density"` or `scoring_metric="boxed_ratio"` plus `output_tree_num=FALSE`, the outputs will correspond to the logarithm of the density rather than the density.

Usage

isotree.to.sql(
  model,
  enclose = "doublequotes",
  output_tree_num = FALSE,
  tree = NULL,
  table_from = NULL,
  select_as = "outlier_score",
  column_names = NULL,
  column_names_categ = NULL
)

Value

  • If passing neither `tree` nor `table_from`, will return a list of `character` objects, containing at each entry the SQL statement for the corresponding tree.

  • If passing `tree`, will return a single `character` object with the SQL statement representing that tree.

  • If passing `table_from`, will return a single `character` object with the full SQL select statement for the outlier score, selecting the columns from the table name passed under `table_from`.

Arguments

model

An Isolation Forest object as returned by isolation.forest.

enclose

With which symbols to enclose the column names in the select statement so as to make them SQL compatible in case they include characters like dots. Options are:

  • `"doublequotes"`, which will enclose them as `"column_name"` - this will work for e.g. PostgreSQL.

  • `"squarebraces"`, which will enclose them as `[column_name]` - this will work for e.g. SQL Server.

  • `"none"`, which will output the column names as-is (e.g. `column_name`)

output_tree_num

Whether to make the statements return the terminal node number instead of the isolation depth. The numeration will start at one.

tree

Tree for which to generate SQL statements. If passed, will generate the statements only for that single tree. If passing `NULL`, will generate statements for all trees in the model.

table_from

If passing this, will generate a single select statement for the outlier score from all trees, selecting the data from the table name passed here. In this case, will always output the outlier score, regardless of what is passed under `output_tree_num`.

select_as

Alias to give to the generated outlier score in the select statement. Ignored when not passing `table_from`.

column_names

Column names to use for the numeric columns. If not passed and the model was fit to a `data.frame`, will use the column names from that `data.frame`, which can be found under `model$metadata$cols_num`. If not passing it and the model was fit to data in a format other than `data.frame`, the columns will be named `column_N` in the resulting SQL statement. Note that the names will be taken verbatim - this function will not do any checks for whether they constitute valid SQL or not, and will not escape characters such as double quotation marks.

column_names_categ

Column names to use for the categorical columns. If not passed, will use the column names from the `data.frame` to which the model was fit. These can be found under `model$metadata$cols_cat`.

Examples

Run this code
library(isotree)
data(iris)
set.seed(1)
iso <- isolation.forest(iris, ntrees=2, sample_size=16, ndim=1, nthreads=1)
sql_forest <- isotree.to.sql(iso, table_from="my_iris_table")
cat(sql_forest)

Run the code above in your browser using DataLab