extract_ice: Internal function to extract data for individual conditional expectation plots.

Description

Computes data for individual conditional expectation plots and partial dependence plots for the model(s) in a familiarEnsemble object.

Usage

extract_ice(
  object,
  data,
  cl = NULL,
  features = NULL,
  feature_x_range = NULL,
  feature_y_range = NULL,
  n_sample_points = 50L,
  ensemble_method = waiver(),
  evaluation_times = waiver(),
  sample_limit = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  is_pre_processed = FALSE,
  message_indent = 0L,
  verbose = FALSE,
  ...
)

Value

A data.table containing individual conditional expectation plot data.

Arguments

object

A familiarEnsemble object, which is an ensemble of one or more familiarModel objects.

data

A dataObject object, data.table or data.frame that constitutes the data that are assessed.

cl

Cluster created using the parallel package. This cluster is then used to speed up computation through parallellisation.

features

Names of the feature or features (2) assessed simultaneously. By default NULL, which means that all features are assessed one-by-one.

feature_x_range

When one or two features are defined using features, feature_x_range can be used to set the range of values for the first feature. For numeric features, a vector of two values is assumed to indicate a range from which n_sample_points are uniformly sampled. A vector of more than two values is interpreted as is, i.e. these represent the values to be sampled. For categorical features, values should represent a (sub)set of available levels.

feature_y_range

As feature_x_range, but for the second feature in case two features are defined.

n_sample_points

Number of points used to sample continuous features.

ensemble_method

Method for ensembling predictions from models for the same sample. Available methods are:

median (default): Use the median of the predicted values as the ensemble value for a sample.
mean: Use the mean of the predicted values as the ensemble value for a sample.

evaluation_times

One or more time points that are used for in analysis of survival problems when data has to be assessed at a set time, e.g. calibration. If not provided explicitly, this parameter is read from settings used at creation of the underlying familiarModel objects. Only used for survival outcomes.

sample_limit

(optional) Set the upper limit of the number of samples that are used during evaluation steps. Cannot be less than 20.

This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. list("sample_similarity"=100, "permutation_vimp"=1000).

This parameter can be set for the following data elements: sample_similarity and ice_data.

detail_level

(optional) Sets the level at which results are computed and aggregated.

ensemble: Results are computed at the ensemble level, i.e. over all models in the ensemble. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the model performance of the ensemble model for each bootstrap.
hybrid (default): Results are computed at the level of models in an ensemble. This means that, for example, bias-corrected estimates of model performance are directly computed using the models in the ensemble. If there are at least 20 trained models in the ensemble, performance is computed for each model, in contrast to ensemble where performance is computed for the ensemble of models. If there are less than 20 trained models in the ensemble, bootstraps are created so that at least 20 point estimates can be made.
model: Results are computed at the model level. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the performance of the model for each bootstrap.

Note that each level of detail has a different interpretation for bootstrap confidence intervals. For ensemble and model these are the confidence intervals for the ensemble and an individual model, respectively. That is, the confidence interval describes the range where an estimate produced by a respective ensemble or model trained on a repeat of the experiment may be found with the probability of the confidence level. For hybrid, it represents the range where any single model trained on a repeat of the experiment may be found with the probability of the confidence level. By definition, confidence intervals obtained using hybrid are at least as wide as those for ensemble. hybrid offers the correct interpretation if the goal of the analysis is to assess the result of a single, unspecified, model.

hybrid is generally computationally less expensive then ensemble, which in turn is somewhat less expensive than model.

A non-default detail_level parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. list("auc_data"="ensemble", "model_performance"="hybrid"). This parameter can be set for the following data elements: auc_data, decision_curve_analyis, model_performance, permutation_vimp, ice_data, prediction_data and confusion_matrix.

estimation_type

(optional) Sets the type of estimation that should be possible. This has the following options:

point: Point estimates.
bias_correction or bc: Bias-corrected estimates. A bias-corrected estimate is computed from (at least) 20 point estimates, and familiar may bootstrap the data to create them.
bootstrap_confidence_interval or bci (default): Bias-corrected estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The number of point estimates required depends on the confidence_level parameter, and familiar may bootstrap the data to create them.

As with detail_level, a non-default estimation_type parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following data elements: auc_data, decision_curve_analyis, model_performance, permutation_vimp, ice_data, and prediction_data.

aggregate_results

(optional) Flag that signifies whether results should be aggregated during evaluation. If estimation_type is bias_correction or bc, aggregation leads to a single bias-corrected estimate. If estimation_type is bootstrap_confidence_interval or bci, aggregation leads to a single bias-corrected estimate with lower and upper boundaries of the confidence interval. This has no effect if estimation_type is point.

The default value is equal to TRUE except when assessing metrics to assess model performance, as the default violin plot requires underlying data.

As with detail_level and estimation_type, a non-default aggregate_results parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists for the same elements as estimation_type.

confidence_level

(optional) Numeric value for the level at which confidence intervals are determined. In the case bootstraps are used to determine the confidence intervals bootstrap estimation, familiar uses the rule of thumb \(n = 20 / ci.level\) to determine the number of required bootstraps.

The default value is 0.95.

bootstrap_ci_method

(optional) Method used to determine bootstrap confidence intervals (Efron and Hastie, 2016). The following methods are implemented:

percentile (default): Confidence intervals obtained using the percentile method.
bc: Bias-corrected confidence intervals.

Note that the standard method is not implemented because this method is often not suitable due to non-normal distributions. The bias-corrected and accelerated (BCa) method is not implemented yet.

is_pre_processed

Flag that indicates whether the data was already pre-processed externally, e.g. normalised and clustered. Only used if the data argument is a data.table or data.frame.

message_indent

Number of indentation steps for messages shown during computation and extraction of various data elements.

verbose

Flag to indicate whether feedback should be provided on the computation and extraction of various data elements.

...

Unused arguments.