.parse_feature_selection_settings: Internal function for parsing settings related to feature selection

Description

Internal function for parsing settings related to feature selection

Usage

.parse_feature_selection_settings(
  config = NULL,
  data,
  parallel,
  outcome_type,
  fs_method = waiver(),
  fs_method_parameter = waiver(),
  vimp_aggregation_method = waiver(),
  vimp_aggregation_rank_threshold = waiver(),
  parallel_feature_selection = waiver(),
  ...
)

Value

List of parameters related to feature selection.

Arguments

config

A list of settings, e.g. from an xml file.

data

Data set as loaded using the .load_data function.

parallel

Logical value that whether familiar uses parallelisation. If FALSE it will override parallel_feature_selection.

outcome_type

Type of outcome found in the data set.

fs_method

(required) Feature selection method to be used for determining variable importance. familiar implements various feature selection methods. Please refer to the vignette on feature selection methods for more details.

More than one feature selection method can be chosen. The experiment will then repeated for each feature selection method.

Feature selection methods determines the ranking of features. Actual selection of features is done by optimising the signature size model hyperparameter during the hyperparameter optimisation step.

fs_method_parameter

(optional) List of lists containing parameters for feature selection methods. Each sublist should have the name of the feature selection method it corresponds to.

Most feature selection methods do not have parameters that can be set. Please refer to the vignette on feature selection methods for more details. Note that if the feature selection method is based on a learner (e.g. lasso regression), hyperparameter optimisation may be performed prior to assessing variable importance.

vimp_aggregation_method

(optional) The method used to aggregate variable importances over different data subsets, e.g. bootstraps. The following methods can be selected:

none: Don't aggregate ranks, but rather aggregate the variable importance scores themselves.
mean: Use the mean rank of a feature over the subsets to determine the aggregated feature rank.
median: Use the median rank of a feature over the subsets to determine the aggregated feature rank.
best: Use the best rank the feature obtained in any subset to determine the aggregated feature rank.
worst: Use the worst rank the feature obtained in any subset to determine the aggregated feature rank.
stability: Use the frequency of the feature being in the subset of highly ranked features as measure for the aggregated feature rank (Meinshausen and Buehlmann, 2010).
exponential: Use a rank-weighted frequence of occurrence in the subset of highly ranked features as measure for the aggregated feature rank (Haury et al., 2011).
borda (default): Use the borda count as measure for the aggregated feature rank (Wald et al., 2012).
enhanced_borda: Use an occurrence frequency-weighted borda count as measure for the aggregated feature rank (Wald et al., 2012).
truncated_borda: Use borda count computed only on features within the subset of highly ranked features.
enhanced_truncated_borda: Apply both the enhanced borda method and the truncated borda method and use the resulting borda count as the aggregated feature rank.

The feature selection methods vignette provides additional information.

vimp_aggregation_rank_threshold

(optional) The threshold used to define the subset of highly important features. If not set, this threshold is determined by maximising the variance in the occurrence value over all features over the subset size.

This parameter is only relevant for stability, exponential, enhanced_borda, truncated_borda and enhanced_truncated_borda methods.

parallel_feature_selection

(optional) Enable parallel processing for the feature selection workflow. Defaults to TRUE. When set to FALSE, this will disable the use of parallel processing while performing feature selection, regardless of the settings of the parallel parameter. parallel_feature_selection is ignored if parallel=FALSE.

...

Unused arguments.

References

Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. & Napolitano, A. An extensive comparison of feature ranking aggregation techniques in bioinformatics. in 2012 IEEE 13th International Conference on Information Reuse Integration (IRI) 377–384 (2012).
Meinshausen, N. & Buehlmann, P. Stability selection. J. R. Stat. Soc. Series B Stat. Methodol. 72, 417–473 (2010).
Haury, A.-C., Gestraud, P. & Vert, J.-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6, e28210 (2011).