export_model_vimp: Extract and export model-based variable importance.

Description

Extract and export model-based variable importance from a familiarCollection.

Usage

export_model_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
# S4 method for familiarCollection
export_model_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)
# S4 method for ANY
export_model_vimp(
  object,
  dir_path = NULL,
  aggregate_results = TRUE,
  aggregation_method = waiver(),
  rank_threshold = waiver(),
  export_collection = FALSE,
  ...
)

Value

A data.table (if dir_path is not provided), or nothing, as all data is exported to csv files.

Arguments

object

A familiarCollection object, or other other objects from which a familiarCollection can be extracted. See details for more information.

dir_path

Path to folder where extracted data should be saved. NULL will allow export as a structured list of data.tables.

aggregate_results

Flag that signifies whether results should be aggregated for export.

aggregation_method

(optional) The method used to aggregate variable importances over different data subsets, e.g. bootstraps. The following methods can be selected:

mean (default): Use the mean rank of a feature over the subsets to determine the aggregated feature rank.
median: Use the median rank of a feature over the subsets to determine the aggregated feature rank.
best: Use the best rank the feature obtained in any subset to determine the aggregated feature rank.
worst: Use the worst rank the feature obtained in any subset to determine the aggregated feature rank.
stability: Use the frequency of the feature being in the subset of highly ranked features as measure for the aggregated feature rank (Meinshausen and Buehlmann, 2010).
exponential: Use a rank-weighted frequence of occurrence in the subset of highly ranked features as measure for the aggregated feature rank (Haury et al., 2011).
borda: Use the borda count as measure for the aggregated feature rank (Wald et al., 2012).
enhanced_borda: Use an occurrence frequency-weighted borda count as measure for the aggregated feature rank (Wald et al., 2012).
truncated_borda: Use borda count computed only on features within the subset of highly ranked features.
enhanced_truncated_borda: Apply both the enhanced borda method and the truncated borda method and use the resulting borda count as the aggregated feature rank.

rank_threshold

(optional) The threshold used to define the subset of highly important features. If not set, this threshold is determined by maximising the variance in the occurrence value over all features over the subset size.

This parameter is only relevant for stability, exponential, enhanced_borda, truncated_borda and enhanced_truncated_borda methods.

export_collection

(optional) Exports the collection if TRUE.

...

Arguments passed on to as_familiar_collection

familiar_data_names: Names of the dataset(s). Only used if the object parameter is one or more familiarData objects.

collection_name

Name of the collection.

Details

Data, such as model performance and calibration information, is usually collected from a familiarCollection object. However, you can also provide one or more familiarData objects, that will be internally converted to a familiarCollection object. It is also possible to provide a familiarEnsemble or one or more familiarModel objects together with the data from which data is computed prior to export. Paths to the previous files can also be provided.

All parameters aside from object and dir_path are only used if object is not a familiarCollection object, or a path to one.

Variable importance is based on the ranking produced by model-specific variable importance routines, e.g. permutation for random forests. If such a routine is absent, variable importance is based on the feature selection method that led to the features included in the model. In case multiple models (familiarModel objects) are combined, feature ranks are first aggregated using the method defined by the aggregation_method, some of which require a rank_threshold to indicate a subset of most important features.

Information concerning highly similar features that form clusters is provided as well. This information is based on consensus clustering of the features that were used in the signatures of the underlying models. This clustering information is also used during aggregation to ensure that co-clustered features are only taken into account once.