Internal function for parsing settings related to feature selection
.parse_feature_selection_settings(
config = NULL,
data,
parallel,
outcome_type,
fs_method = waiver(),
fs_method_parameter = waiver(),
vimp_aggregation_method = waiver(),
vimp_aggregation_rank_threshold = waiver(),
parallel_feature_selection = waiver(),
...
)
List of parameters related to feature selection.
A list of settings, e.g. from an xml file.
Data set as loaded using the .load_data
function.
Logical value that whether familiar uses parallelisation. If
FALSE
it will override parallel_feature_selection
.
Type of outcome found in the data set.
(required) Feature selection method to be used for
determining variable importance. familiar
implements various feature
selection methods. Please refer to the vignette on feature selection
methods for more details.
More than one feature selection method can be chosen. The experiment will then repeated for each feature selection method.
Feature selection methods determines the ranking of features. Actual selection of features is done by optimising the signature size model hyperparameter during the hyperparameter optimisation step.
(optional) List of lists containing parameters for feature selection methods. Each sublist should have the name of the feature selection method it corresponds to.
Most feature selection methods do not have parameters that can be set. Please refer to the vignette on feature selection methods for more details. Note that if the feature selection method is based on a learner (e.g. lasso regression), hyperparameter optimisation may be performed prior to assessing variable importance.
(optional) The method used to aggregate variable importances over different data subsets, e.g. bootstraps. The following methods can be selected:
none
: Don't aggregate ranks, but rather aggregate the variable
importance scores themselves.
mean
: Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
median
: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
best
: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
worst
: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
stability
: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
exponential
: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
borda
(default): Use the borda count as measure for the aggregated
feature rank (Wald et al., 2012).
enhanced_borda
: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
truncated_borda
: Use borda count computed only on features within the
subset of highly ranked features.
enhanced_truncated_borda
: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
The feature selection methods vignette provides additional information.
(optional) The threshold used to define the subset of highly important features. If not set, this threshold is determined by maximising the variance in the occurrence value over all features over the subset size.
This parameter is only relevant for stability
, exponential
,
enhanced_borda
, truncated_borda
and enhanced_truncated_borda
methods.
(optional) Enable parallel processing for
the feature selection workflow. Defaults to TRUE
. When set to FALSE
,
this will disable the use of parallel processing while performing feature
selection, regardless of the settings of the parallel
parameter.
parallel_feature_selection
is ignored if parallel=FALSE
.
Unused arguments.
Wald, R., Khoshgoftaar, T. M., Dittman, D., Awada, W. & Napolitano, A. An extensive comparison of feature ranking aggregation techniques in bioinformatics. in 2012 IEEE 13th International Conference on Information Reuse Integration (IRI) 377–384 (2012).
Meinshausen, N. & Buehlmann, P. Stability selection. J. R. Stat. Soc. Series B Stat. Methodol. 72, 417–473 (2010).
Haury, A.-C., Gestraud, P. & Vert, J.-P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6, e28210 (2011).