Learn R Programming

isotree (version 0.6.1-1)

isotree.to.json: Generate JSON representations of model trees

Description

Generates a JSON representation of either a single tree in the model, or of all the trees in the model.

The JSON for a given tree will consist of a sub-json/list for each node, where nodes are indexed by their number (base-1 indexing) as keys in these JSONs (note that they are strings, not numbers, in order to conform to JSON format).

Nodes will in turn consist of another map/list indicating whether they are terminal nodes or not, their score and terminal node index if terminal, or otherwise the split conditions, nodes to follow when the condition is or isn't met, and other aspects such as imputation values if applicable, acceptable ranges when using range penalizations, fraction of the data that went into the left node if recorded, among others.

Note that the JSON structure will be very different for models that have `ndim=1` than for models that have `ndim>1`. In the case of `ndim=1`, the conditions are based on the value of only one variable, but for `ndim=2`, they will consist of a linear combination of different columns (which is expressed as a list of JSONs with one entry per column that goes into the calculation) - for numeric columns for example, these will be expressed in the json by a coefficient for the given column, and a centering that needs to be applied, with the score from that column being added as

\(\text{coef} \times (x - \text{centering})\)

and the imputation value being applied in replacement of this formula in the case of missing values for that column (depending on the model parameters); while in the case of categorical columns, might either have a different coefficient for each possible category (`categ_split_type="subset"`), or a single category that gets a non-zero coefficient while the others get zeros (`categ_split_type="single_categ"`).

The JSONs might contain redundant information in order to ease understanding of the model logic - for example, when using `ndim>1` and `categ_split_type="single_categ"`, the coefficient for the non-chosen categories will always be zero, but is nevertheless added to every node's JSON, even if not needed.

Usage

isotree.to.json(
  model,
  output_tree_num = FALSE,
  tree = NULL,
  column_names = NULL,
  column_names_categ = NULL,
  as_str = FALSE,
  nthreads = model$nthreads
)

Value

Either a list of lists (when passing `as_str=FALSE`) or a vector of characters (when passing `as_str=TRUE`), or a single such list or character element if passing `tree`.

Arguments

model

An Isolation Forest object as returned by isolation.forest.

output_tree_num

Whether to make the statements / outputs return the terminal node number instead of the isolation depth. The numeration will start at one.

tree

Tree for which to generate SQL statements or other outputs. If passed, will generate the statements only for that single tree. If passing `NULL`, will generate statements for all trees in the model.

column_names

Column names to use for the numeric columns. If not passed and the model was fit to a `data.frame`, will use the column names from that `data.frame`, which can be found under `model$metadata$cols_num`. If not passing it and the model was fit to data in a format other than `data.frame`, the columns will be named `column_N` in the resulting SQL statement. Note that the names will be taken verbatim - this function will not do any checks for e.g. whether they constitute valid SQL or not when exporting to SQL, and will not escape characters such as double quotation marks when exporting to SQL.

column_names_categ

Column names to use for the categorical columns. If not passed, will use the column names from the `data.frame` to which the model was fit. These can be found under `model$metadata$cols_cat`.

as_str

Whether to return the result as raw JSON strings (returned as R's character type) instead of being parsed into R lists (internally, it uses `jsonlite::fromJSON`).

nthreads

Number of parallel threads to use.

Details

  • If using `scoring_metric="density"` or `scoring_metric="boxed_ratio"` plus `output_tree_num=FALSE`, the outputs will correspond to the logarithm of the density rather than the density.