Extract and export feature selection variable importance from a familiarCollection.
export_fs_vimp(
object,
dir_path = NULL,
aggregate_results = TRUE,
aggregation_method = waiver(),
rank_threshold = waiver(),
export_collection = FALSE,
...
)# S4 method for familiarCollection
export_fs_vimp(
object,
dir_path = NULL,
aggregate_results = TRUE,
aggregation_method = waiver(),
rank_threshold = waiver(),
export_collection = FALSE,
...
)
# S4 method for ANY
export_fs_vimp(
object,
dir_path = NULL,
aggregate_results = TRUE,
aggregation_method = waiver(),
rank_threshold = waiver(),
export_collection = FALSE,
...
)
A data.table (if dir_path
is not provided), or nothing, as all data
is exported to csv
files.
A familiarCollection
object, or other other objects from which
a familiarCollection
can be extracted. See details for more information.
Path to folder where extracted data should be saved. NULL
will allow export as a structured list of data.tables.
Flag that signifies whether results should be aggregated for export.
(optional) The method used to aggregate variable importances over different data subsets, e.g. bootstraps. The following methods can be selected:
mean
(default): Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
median
: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
best
: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
worst
: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
stability
: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
exponential
: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
borda
: Use the borda count as measure for the aggregated feature rank
(Wald et al., 2012).
enhanced_borda
: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
truncated_borda
: Use borda count computed only on features within the
subset of highly ranked features.
enhanced_truncated_borda
: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
(optional) The threshold used to define the subset of highly important features. If not set, this threshold is determined by maximising the variance in the occurrence value over all features over the subset size.
This parameter is only relevant for stability
, exponential
,
enhanced_borda
, truncated_borda
and enhanced_truncated_borda
methods.
(optional) Exports the collection if TRUE.
Arguments passed on to as_familiar_collection
familiar_data_names
Names of the dataset(s). Only used if the object
parameter is one or more familiarData
objects.
collection_name
Name of the collection.
Data, such as model performance and calibration information, is
usually collected from a familiarCollection
object. However, you can also
provide one or more familiarData
objects, that will be internally
converted to a familiarCollection
object. Paths to the previous files can
also be provided.
Unlike other export function, export using familiarEnsemble
or
familiarModel
objects is not possible. This is because feature selection
variable importance is not stored within familiarModel
objects.
All parameters aside from object
and dir_path
are only used if object
is not a familiarCollection
object, or a path to one.
Variable importance is based on the ranking produced by feature selection
routines. In case feature selection was performed repeatedly, e.g. using
bootstraps, feature ranks are first aggregated using the method defined by
the aggregation_method
, some of which require a rank_threshold
to
indicate a subset of most important features.
Information concerning highly similar features that form clusters is provided as well. This information is based on consensus clustering of the features. This clustering information is also used during aggregation to ensure that co-clustered features are only taken into account once.