.parse_evaluation_settings: Internal function for parsing settings related to model evaluation

Description

Internal function for parsing settings related to model evaluation

Usage

.parse_evaluation_settings(
  config = NULL,
  data,
  parallel,
  outcome_type,
  hpo_metric,
  development_batch_id,
  vimp_aggregation_method,
  vimp_aggregation_rank_threshold,
  prep_cluster_method,
  prep_cluster_linkage_method,
  prep_cluster_cut_method,
  prep_cluster_similarity_threshold,
  prep_cluster_similarity_metric,
  evaluate_top_level_only = waiver(),
  skip_evaluation_elements = waiver(),
  ensemble_method = waiver(),
  evaluation_metric = waiver(),
  sample_limit = waiver(),
  detail_level = waiver(),
  estimation_type = waiver(),
  aggregate_results = waiver(),
  confidence_level = waiver(),
  bootstrap_ci_method = waiver(),
  feature_cluster_method = waiver(),
  feature_cluster_cut_method = waiver(),
  feature_linkage_method = waiver(),
  feature_similarity_metric = waiver(),
  feature_similarity_threshold = waiver(),
  sample_cluster_method = waiver(),
  sample_linkage_method = waiver(),
  sample_similarity_metric = waiver(),
  eval_aggregation_method = waiver(),
  eval_aggregation_rank_threshold = waiver(),
  eval_icc_type = waiver(),
  stratification_method = waiver(),
  stratification_threshold = waiver(),
  time_max = waiver(),
  evaluation_times = waiver(),
  dynamic_model_loading = waiver(),
  parallel_evaluation = waiver(),
  ...
)

Value

List of parameters related to model evaluation.

Arguments

config

A list of settings, e.g. from an xml file.

data

Data set as loaded using the .load_data function.

parallel

Logical value that whether familiar uses parallelisation. If FALSE it will override parallel_evaluation.

outcome_type

Type of outcome found in the data set.

hpo_metric

Metric defined for hyperparameter optimisation.

development_batch_id

Identifiers of batches used for model development. These identifiers are used to determine the cohorts used to determine a setting for time_max, if the outcome_type is survival, and both time_max and evaluation_times are not provided.

vimp_aggregation_method

Method for variable importance aggregation that was used for feature selection.

vimp_aggregation_rank_threshold

Rank threshold for variable importance aggregation used during feature selection.

prep_cluster_method

Cluster method used during pre-processing.

prep_cluster_linkage_method

Cluster linkage method used during pre-processing.

prep_cluster_cut_method

Cluster cut method used during pre-processing.

prep_cluster_similarity_threshold

Cluster similarity threshold used during pre-processing.

prep_cluster_similarity_metric

Cluster similarity metric used during pre-processing.

evaluate_top_level_only

(optional) Flag that signals that only evaluation at the most global experiment level is required. Consider a cross-validation experiment with additional external validation. The global experiment level consists of data that are used for development, internal validation and external validation. The next lower experiment level are the individual cross-validation iterations.

When the flag is true, evaluations take place on the global level only, and no results are generated for the next lower experiment levels. In our example, this means that results from individual cross-validation iterations are not computed and shown. When the flag is false, results are computed from both the global layer and the next lower level.

Setting the flag to true saves computation time.

skip_evaluation_elements

(optional) Specifies which evaluation steps, if any, should be skipped as part of the evaluation process. Defaults to none, which means that all relevant evaluation steps are performed. It can have one or more of the following values:

none, false: no steps are skipped.
all, true: all steps are skipped.
auc_data: data for assessing and plotting the area under the receiver operating characteristic curve are not computed.
calibration_data: data for assessing and plotting model calibration are not computed.
calibration_info: data required to assess calibration, such as baseline survival curves, are not collected. These data will still be present in the models.
confusion_matrix: data for assessing and plotting a confusion matrix are not collected.
decision_curve_analyis: data for performing a decision curve analysis are not computed.
feature_expressions: data for assessing and plotting sample clustering are not computed.
feature_similarity: data for assessing and plotting feature clusters are not computed.
fs_vimp: data for assessing and plotting feature selection-based variable importance are not collected.
hyperparameters: data for assessing model hyperparameters are not collected. These data will still be present in the models.
ice_data: data for individual conditional expectation and partial dependence plots are not created.
model_performance: data for assessing and visualising model performance are not created.
model_vimp: data for assessing and plotting model-based variable importance are not collected.
permutation_vimp: data for assessing and plotting model-agnostic permutation variable importance are not computed.
prediction_data: predictions for each sample are not made and exported.
risk_stratification_data: data for assessing and plotting Kaplan-Meier survival curves are not collected.
risk_stratification_info: data for assessing stratification into risk groups are not computed.
univariate_analysis: data for assessing and plotting univariate feature importance are not computed.

ensemble_method

(optional) Method for ensembling predictions from models for the same sample. Available methods are:

median (default): Use the median of the predicted values as the ensemble value for a sample.
mean: Use the mean of the predicted values as the ensemble value for a sample.

This parameter is only used if detail_level is ensemble.

evaluation_metric

(optional) One or more metrics for assessing model performance. See the vignette on performance metrics for the available metrics.

Confidence intervals (or rather credibility intervals) are computed for each metric during evaluation. This is done using bootstraps, the number of which depends on the value of confidence_level (Davison and Hinkley, 1997).

If unset, the metric in the optimisation_metric variable is used.

sample_limit

(optional) Set the upper limit of the number of samples that are used during evaluation steps. Cannot be less than 20.

This setting can be specified per data element by providing a parameter value in a named list with data elements, e.g. list("sample_similarity"=100, "permutation_vimp"=1000).

This parameter can be set for the following data elements: sample_similarity and ice_data.

detail_level

(optional) Sets the level at which results are computed and aggregated.

ensemble: Results are computed at the ensemble level, i.e. over all models in the ensemble. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the model performance of the ensemble model for each bootstrap.
hybrid (default): Results are computed at the level of models in an ensemble. This means that, for example, bias-corrected estimates of model performance are directly computed using the models in the ensemble. If there are at least 20 trained models in the ensemble, performance is computed for each model, in contrast to ensemble where performance is computed for the ensemble of models. If there are less than 20 trained models in the ensemble, bootstraps are created so that at least 20 point estimates can be made.
model: Results are computed at the model level. This means that, for example, bias-corrected estimates of model performance are assessed by creating (at least) 20 bootstraps and computing the performance of the model for each bootstrap.

Note that each level of detail has a different interpretation for bootstrap confidence intervals. For ensemble and model these are the confidence intervals for the ensemble and an individual model, respectively. That is, the confidence interval describes the range where an estimate produced by a respective ensemble or model trained on a repeat of the experiment may be found with the probability of the confidence level. For hybrid, it represents the range where any single model trained on a repeat of the experiment may be found with the probability of the confidence level. By definition, confidence intervals obtained using hybrid are at least as wide as those for ensemble. hybrid offers the correct interpretation if the goal of the analysis is to assess the result of a single, unspecified, model.

hybrid is generally computationally less expensive then ensemble, which in turn is somewhat less expensive than model.

A non-default detail_level parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. list("auc_data"="ensemble", "model_performance"="hybrid"). This parameter can be set for the following data elements: auc_data, decision_curve_analyis, model_performance, permutation_vimp, ice_data, prediction_data and confusion_matrix.

estimation_type

(optional) Sets the type of estimation that should be possible. This has the following options:

point: Point estimates.
bias_correction or bc: Bias-corrected estimates. A bias-corrected estimate is computed from (at least) 20 point estimates, and familiar may bootstrap the data to create them.
bootstrap_confidence_interval or bci (default): Bias-corrected estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The number of point estimates required depends on the confidence_level parameter, and familiar may bootstrap the data to create them.

As with detail_level, a non-default estimation_type parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. list("auc_data"="bci", "model_performance"="point"). This parameter can be set for the following data elements: auc_data, decision_curve_analyis, model_performance, permutation_vimp, ice_data, and prediction_data.

aggregate_results

(optional) Flag that signifies whether results should be aggregated during evaluation. If estimation_type is bias_correction or bc, aggregation leads to a single bias-corrected estimate. If estimation_type is bootstrap_confidence_interval or bci, aggregation leads to a single bias-corrected estimate with lower and upper boundaries of the confidence interval. This has no effect if estimation_type is point.

The default value is equal to TRUE except when assessing metrics to assess model performance, as the default violin plot requires underlying data.

As with detail_level and estimation_type, a non-default aggregate_results parameter can be specified for separate evaluation steps by providing a parameter value in a named list with data elements, e.g. list("auc_data"=TRUE, , "model_performance"=FALSE). This parameter exists for the same elements as estimation_type.

confidence_level

(optional) Numeric value for the level at which confidence intervals are determined. In the case bootstraps are used to determine the confidence intervals bootstrap estimation, familiar uses the rule of thumb \(n = 20 / ci.level\) to determine the number of required bootstraps.

The default value is 0.95.

bootstrap_ci_method

(optional) Method used to determine bootstrap confidence intervals (Efron and Hastie, 2016). The following methods are implemented:

percentile (default): Confidence intervals obtained using the percentile method.
bc: Bias-corrected confidence intervals.

Note that the standard method is not implemented because this method is often not suitable due to non-normal distributions. The bias-corrected and accelerated (BCa) method is not implemented yet.

feature_cluster_method

(optional) Method used to perform clustering of features. The same methods as for the cluster_method configuration parameter are available: none, hclust, agnes, diana and pam.

The value for the cluster_method configuration parameter is used by default. When generating clusters for the purpose of determining mutual correlation and ordering feature expressions, none is ignored and hclust is used instead.

feature_cluster_cut_method

(optional) Method used to divide features into separate clusters. The available methods are the same as for the cluster_cut_method configuration parameter: silhouette, fixed_cut and dynamic_cut.

silhouette is available for all cluster methods, but fixed_cut only applies to methods that create hierarchical trees (hclust, agnes and diana). dynamic_cut requires the dynamicTreeCut package and can only be used with agnes and hclust.

The value for the cluster_cut_method configuration parameter is used by default.

feature_linkage_method

(optional) Method used for agglomerative clustering with hclust and agnes. Linkage determines how features are sequentially combined into clusters based on distance. The methods are shared with the cluster_linkage_method configuration parameter: average, single, complete, weighted, and ward.

The value for the cluster_linkage_method configuration parameters is used by default.

feature_similarity_metric

(optional) Metric to determine pairwise similarity between features. Similarity is computed in the same manner as for clustering, and feature_similarity_metric therefore has the same options as cluster_similarity_metric: mcfadden_r2, cox_snell_r2, nagelkerke_r2, mutual_information, spearman, kendall and pearson.

The value used for the cluster_similarity_metric configuration parameter is used by default.

feature_similarity_threshold

(optional) The threshold level for pair-wise similarity that is required to form feature clusters with the fixed_cut method. This threshold functions in the same manner as the one defined using the cluster_similarity_threshold parameter.

By default, the value for the cluster_similarity_threshold configuration parameter is used.

Unlike for cluster_similarity_threshold, more than one value can be supplied here.

sample_cluster_method

(optional) The method used to perform clustering based on distance between samples. These are the same methods as for the cluster_method configuration parameter: hclust, agnes, diana and pam.

The value for the cluster_method configuration parameter is used by default. When generating clusters for the purpose of ordering samples in feature expressions, none is ignored and hclust is used instead.

sample_linkage_method

(optional) The method used for agglomerative clustering in hclust and agnes. These are the same methods as for the cluster_linkage_method configuration parameter: average, single, complete, weighted, and ward.

The value for the cluster_linkage_method configuration parameters is used by default.

sample_similarity_metric

(optional) Metric to determine pairwise similarity between samples. Similarity is computed in the same manner as for clustering, but sample_similarity_metric has different options that are better suited to computing distance between samples instead of between features. The following metrics are available.

gower (default): compute Gower's distance between samples. By default, Gower's distance is computed based on winsorised data to reduce the effect of outliers (see below).
euclidean: compute the Euclidean distance between samples.

The underlying feature data for numerical features is scaled to the \([0,1]\) range using the feature values across the samples. The normalisation parameters required can optionally be computed from feature data with the outer 5% (on both sides) of feature values trimmed or winsorised. To do so append _trim (trimming) or _winsor (winsorising) to the metric name. This reduces the effect of outliers somewhat.

Regardless of metric, all categorical features are handled as for the Gower's distance: distance is 0 if the values in a pair of samples match, and 1 if they do not.

eval_aggregation_method

(optional) Method for aggregating variable importances for the purpose of evaluation. Variable importances are determined during feature selection steps and after training the model. Both types are evaluated, but feature selection variable importance is only evaluated at run-time.

See the documentation for the vimp_aggregation_method argument for information concerning the different methods available.

eval_aggregation_rank_threshold

(optional) The threshold used to define the subset of highly important features during evaluation.

See the documentation for the vimp_aggregation_rank_threshold argument for more information.

eval_icc_type

(optional) String indicating the type of intraclass correlation coefficient (1, 2 or 3) that should be used to compute robustness for features in repeated measurements during the evaluation of univariate importance. These types correspond to the types in Shrout and Fleiss (1979). The default value is 1.

stratification_method

(optional) Method for determining the stratification threshold for creating survival groups. The actual, model-dependent, threshold value is obtained from the development data, and can afterwards be used to perform stratification on validation data.

The following stratification methods are available:

median (default): The median predicted value in the development cohort is used to stratify the samples into two risk groups. For predicted outcome values that build a continuous spectrum, the two risk groups in the development cohort will be roughly equal in size.
mean: The mean predicted value in the development cohort is used to stratify the samples into two risk groups.
mean_trim: As mean, but based on the set of predicted values where the 5% lowest and 5% highest values are discarded. This reduces the effect of outliers.
mean_winsor: As mean, but based on the set of predicted values where the 5% lowest and 5% highest values are winsorised. This reduces the effect of outliers.
fixed: Samples are stratified based on the sample quantiles of the predicted values. These quantiles are defined using the stratification_threshold parameter.
optimised: Use maximally selected rank statistics to determine the optimal threshold (Lausen and Schumacher, 1992; Hothorn et al., 2003) to stratify samples into two optimally separated risk groups.

One or more stratification methods can be selected simultaneously.

This parameter is only relevant for survival outcomes.

stratification_threshold

(optional) Numeric value(s) signifying the sample quantiles for stratification using the fixed method. The number of risk groups will be the number of values +1.

The default value is c(1/3, 2/3), which will yield two thresholds that divide samples into three equally sized groups. If fixed is not among the selected stratification methods, this parameter is ignored.

This parameter is only relevant for survival outcomes.

time_max

(optional) Time point which is used as the benchmark for e.g. cumulative risks generated by random forest, or the cutoff for Uno's concordance index.

If time_max is not provided, but evaluation_times is, the largest value of evaluation_times is used. If both are not provided, time_max is set to the 98th percentile of the distribution of survival times for samples with an event in the development data set.

This parameter is only relevant for survival outcomes.

evaluation_times

(optional) One or more time points that are used for assessing calibration in survival problems. This is done as expected and observed survival probabilities depend on time.

If unset, evaluation_times will be equal to time_max.

This parameter is only relevant for survival outcomes.

dynamic_model_loading

(optional) Enables dynamic loading of models during the evaluation process, if TRUE. Defaults to FALSE. Dynamic loading of models may reduce the overall memory footprint, at the cost of increased disk or network IO. Models can only be dynamically loaded if they are found at an accessible disk or network location. Setting this parameter to TRUE may help if parallel processing causes out-of-memory issues during evaluation.

parallel_evaluation

(optional) Enable parallel processing for hyperparameter optimisation. Defaults to TRUE. When set to FALSE, this will disable the use of parallel processing while performing optimisation, regardless of the settings of the parallel parameter. The parameter moreover specifies whether parallelisation takes place within the evaluation process steps (inner, default), or in an outer loop ( outer) over learners, data subsamples, etc.

parallel_evaluation is ignored if parallel=FALSE.

...

Unused arguments.

References

Davison, A. C. & Hinkley, D. V. Bootstrap methods and their application. (Cambridge University Press, 1997).
Efron, B. & Hastie, T. Computer Age Statistical Inference. (Cambridge University Press, 2016).
Lausen, B. & Schumacher, M. Maximally Selected Rank Statistics. Biometrics 48, 73 (1992).
Hothorn, T. & Lausen, B. On the exact distribution of maximally selected rank statistics. Comput. Stat. Data Anal. 43, 121–137 (2003).