makeFilter: Create a feature filter.

Description

Creates and registers custom feature filters. Implemented filters can be listed with listFilterMethods. Additional documentation for the fun parameter specific to each filter can be found in the description. Minimum redundancy, maximum relevance filter “mrmr” computes the mutual information between the target and each individual feature minus the average mutual information of previously selected features and this feature using the mRMRe package. Filter “carscore” determines the “Correlation-Adjusted (marginal) coRelation scores” (short CAR scores). The CAR scores for a set of features are defined as the correlations between the target and the decorrelated features. Filter “randomForestSRC.rfsrc” computes the importance of random forests fitted in package randomForestSRC. The concrete method is selected via the method parameter. Possible values are permute (default), random, anti, permute.ensemble, random.ensemble, anti.ensemble. See the VIMP section in the docs for rfsrc for details. Filter “randomForestSRC.var.select” uses the minimal depth variable selection proposed by Ishwaran et al. (2010) (method = "md") or a variable hunting approach (method = "vh" or method = "vh.vimp"). The minimal depth measure is the default. Permutation importance of random forests fitted in package party. The implementation follows the principle of mean decrese in accuracy used by the randomForest package (see description of “randomForest.importance”) filter. Filter “randomForest.importance” makes use of the importance from package randomForest. The importance measure to use is selected via the method parameter:

oob.accuracy: Permutation of Out of Bag (OOB) data.
node.impurity: Total decrease in node impurity.

The Pearson correlation between each feature and the target is used as an indicator of feature importance. Rows with NA values are not taken into consideration. The Spearman correlation between each feature and the target is used as an indicator of feature importance. Rows with NA values are not taken into consideration. Filter “information.gain” uses the entropy-based information gain between each feature and target individually as an importance measure. Filter “gain.ratio” uses the entropy-based information gain ratio between each feature and target individually as an importance measure. Filter “symmetrical.uncertainty” uses the entropy-based symmetrical uncertainty between each feature and target individually as an importance measure. The chi-square test is a statistical test of independence to determine whether two variables are independent. Filter “chi.squared” applies this test in the following way. For each feature the chi-square test statistic is computed checking if there is a dependency between the feature and the target variable. Low values of the test statistic indicate a poor relationship. High values, i.e., high dependency identifies a feature as more important. Filter “relief” is based on the feature selection algorithm “ReliefF” by Kononenko et al., which is a generalization of the orignal “Relief” algorithm originally proposed by Kira and Rendell. Feature weights are initialized with zeros. Then for each instance sample.size instances are sampled, neighbours.count nearest-hit and nearest-miss neighbours are computed and the weight vector for each feature is updated based on these values. Filter “oneR” makes use of a simple “One-Rule” (OneR) learner to determine feature importance. For this purpose the OneR learner generates one simple association rule for each feature in the data individually and computes the total error. The lower the error value the more important the correspoding feature. The “univariate.model.score” feature filter resamples an mlr learner specified via perf.learner for each feature individually with randomForest from package rpart being the default learner. Further parameter are the resamling strategey perf.resampling and the performance measure perf.measure. Filter “anova.test” is based on the Analysis of Variance (ANOVA) between feature and class. The value of the F-statistic is used as a measure of feature importance. Filter “kruskal.test” applies a Kruskal-Wallis rank sum test of the null hypothesis that the location parameters of the distribution of a feature are the same in each class and considers the test statistic as an variable importance measure: if the location parameters do not differ in at least one case, i.e., the null hypothesis cannot be rejected, there is little evidence that the corresponding feature is suitable for classification. Simple filter based on the variance of the features indepentent of each other. Features with higher variance are considered more important than features with low importance. Filter “permutation.importance” computes a loss function between predictions made by a learner before and after a feature is permuted. Special arguments to the filter function are imp.learner, a [Learner or character(1)] which specifies the learner to use when computing the permutation importance, contrast, a function which takes two numeric vectors and returns one (default is the difference), aggregation, a function which takes a numeric and returns a numeric(1) (default is the mean), nmc, an integer(1), and replace, a logical(1) which determines whether the feature being permuted is sampled with or without replacement.

Usage

makeFilter(name, desc, pkg, supported.tasks, supported.features, fun)
rf.importance
rf.min.depth
univariate

Arguments

name

[character(1)] Identifier for the filter.

desc

[character(1)] Short description of the filter.

pkg

[character(1)] Source package where the filter is implemented.

supported.tasks

[character] Task types supported.

supported.features

[character] Feature types supported.

fun

[function(task, nselect, ...] Function which takes a task and returns a named numeric vector of scores, one score for each feature of task. Higher scores mean higher importance of the feature. At least nselect features must be calculated, the remaining may be set to NA or omitted, and thus will not be selected. the original order will be restored if necessary.

Value

Object of class “Filter”.

Format

An object of class Filter of length 6.

References

Kira, Kenji and Rendell, Larry (1992). The Feature Selection Problem: Traditional Methods and a New Algorithm. AAAI-92 Proceedings. Kononenko, Igor et al. Overcoming the myopia of inductive learning algorithms with RELIEFF (1997), Applied Intelligence, 7(1), p39-55.