attrEval: Attribute evaluation

Description

The method evaluates the quality of the features/attributes/dependent variables specified by the formula with the selected heuristic method. Feature evaluation algorithms available for classification problems are various variants of Relief and ReliefF algorithms (ReliefF, cost-sensitive ReliefF, …), gain ratio, gini-index, MDL, DKM, information gain, ... For regression problems there are RREliefF, MSEofMean, MSEofModel, MAEofModel, ... Parallel execution on several cores is supported for speedup.

Usage

attrEval(formula, data, estimator, costMatrix = NULL, 
           outputNumericSplits=FALSE, ...)

Arguments

formula

Either a formula specifying the attributes to be evaluated and the target variable, or a name of target variable, or an index of target variable.

data

Data frame with evaluation data.

estimator

The name of the evaluation method.

costMatrix

Optional cost matrix used with certain estimators.

outputNumericSplits

Controls of the output contain also the best split point for numeric attributes. This is only sensible for impurity based estimators (like gini, MDL, gain ratio in classification and MSEofMean in regression). Additionally, the default value of parameter binaryEvaluateNumericAttributes=TRUE shall not be modified. If the value of outputNumericSplits the output is a list instead of vector, see the returned value description.

…

Additional options used by specific evaluation methods as described in helpCore.

Value

The method returns a vector of evaluations for the features in the order specified by the formula. In case of parameter binaryEvaluateNumericAttributes=TRUE the method returns a list with two components: attrEval and splitPointNum. The attrEval contains a vector of evaluations for the features in the order specified by the formula. The splitPointNum contains the split points of numeric attributes which produced the given attribute evaluation scores.

Details

The parameter formula can be interpreted in three ways, where the formula interface is the most elegant one, but inefficient and inappropriate for large data sets. See also examples below. As formula one can specify:

an object of class formula: used as a mechanism to select features (attributes) and prediction variable (class). Only simple terms can be used and interaction expressed in formula syntax are not supported. The simplest way is to specify just response variable: class ~ .. In this case all other attributes in the data set are evaluated. Note that formula interface is not appropriate for data sets with large number of variables.
a character vector: specifying the name of target variable, all the other columns in data frame data are used as predictors.
an integer: specifying the index of of target variable in data frame data, all the other columns are used as predictors.

The optional parameter costMatrix can provide nonuniform cost matrix to certain cost-sensitive measures (ReliefFexpC, ReliefFavgC, ReliefFpe, ReliefFpa, ReliefFsmp,GainRatioCost, DKMcost, ReliefKukar, and MDLsmp). For other measures this parameter is ignored. The format of the matrix is costMatrix(true class, predicted class). By default a uniform costs are assumed, i.e., costMatrix(i, i) = 0, and costMatrix(i, j) = 1, for i not equal to j.

The estimator parameter selects the evaluation heuristics. For classification problem it must be one of the names returned by infoCore(what="attrEval") and for regression problem it must be one of the names returned by infoCore(what="attrEvalReg") Majority of these feature evaluation measures are described in the references given below, here only a short description is given. For classification problem they are

"ReliefFequalK": ReliefF algorithm where k nearest instances have equal weight.
"ReliefFexpRank": ReliefF algorithm where k nearest instances have weight exponentially decreasing with increasing rank. Rank of nearest instance is determined by the increasing (Manhattan) distance from the selected instance. This is a default choice for methods taking conditional dependencies among the attributes into account.
"ReliefFbestK": ReliefF algorithm where all possible k (representing k nearest instances) are tested and for each feature the highest score is returned. Nearest instances have equal weights.
"Relief": Original algorithm of Kira and Rendel (1991) working on two class problems.
"InfGain": Information gain.
"GainRatio": Gain ratio, which is normalized information gain to prevent bias to multi-valued attributes.
"MDL": Acronym for Minimum Description Length, presents method introduced in (Kononenko, 1995) with favorable bias for multi-valued and multi-class problems. Might be the best method among those not taking conditional dependencies into account.
"Gini": Gini-index.
"MyopicReliefF": Myopic version of ReliefF resulting from assumption of no local dependencies and attribute dependencies upon class.
"Accuracy": Accuracy of resulting split.
"ReliefFmerit": ReliefF algorithm where for each random instance the merit of each attribute is normalized by the sum of differences in all attributes.
"ReliefFdistance": ReliefF algorithm where k nearest instances are weighed directly with its inverse distance from the selected instance. Usually using ranks instead of distance as in ReliefFexpRank is more effective.
"ReliefFsqrDistance": ReliefF algorithm where k nearest instances are weighed with its inverse square distance from the selected instance.
"DKM": Measure named after Dietterich, Kearns, and Mansour who proposed it in 1996.
"ReliefFexpC": Cost-sensitive ReliefF algorithm with expected costs.
"ReliefFavgC": Cost-sensitive ReliefF algorithm with average costs.
"ReliefFpe": Cost-sensitive ReliefF algorithm with expected probability.
"ReliefFpa": Cost-sensitive ReliefF algorithm with average probability.
"ReliefFsmp": Cost-sensitive ReliefF algorithm with cost sensitive sampling.
"GainRatioCost": Cost-sensitive variant of GainRatio.
"DKMcost": Cost-sensitive variant of DKM.
"ReliefKukar": Cost-sensitive Relief algorithm introduced by Kukar in 1999.
"MDLsmp": Cost-sensitive variant of MDL where costs are introduced through sampling.
"ImpurityEuclid": Euclidean distance as impurity function on within node class distributions.
"ImpurityHellinger": Hellinger distance as impurity function on within node class distributions.
"UniformDKM": Dietterich-Kearns-Mansour (DKM) with uniform priors.
"UniformGini": Gini index with uniform priors.
"UniformInf": Information gain with uniform priors.
"UniformAccuracy": Accuracy with uniform priors.
"EqualDKM": Dietterich-Kearns-Mansour (DKM) with equal weights for splits.
"EqualGini": Gini index with equal weights for splits.
"EqualInf": Information gain with equal weights for splits.
"EqualHellinger": Two equally weighted splits based Hellinger distance.
"DistHellinger": Hellinger distance between class distributions in branches.
"DistAUC": AUC distance between splits.
"DistAngle": Cosine of angular distance between splits.
"DistEuclid": Euclidean distance between splits.

For regression problem the implemented measures are:

"RReliefFequalK": RReliefF algorithm where k nearest instances have equal weight.
"ReliefFexpRank": RReliefF algorithm where k nearest instances have weight exponentially decreasing with increasing rank. Rank of nearest instance is determined by the increasing (Manhattan) distance from the selected instance. This is a default choice for methods taking conditional dependencies among the attributes into account.
"RReliefFbestK": RReliefF algorithm where all possible k (representing k nearest instances) are tested and for each feature the highest score is returned. Nearest instances have equal weights.
"RReliefFwithMSE": A combination of RReliefF and MSE algorithms.
"MSEofMean": Mean Squared Error as heuristic used to measure error by mean predicted value after split on the feature.
"MSEofModel": Mean Squared Error of an arbitrary model used on splits resulting from the feature. The model is chosen with parameter modelTypeReg.
"MAEofModel": Mean Absolute Error of an arbitrary model used on splits resulting from the feature. The model is chosen with parameter modelTypeReg. If we use median as the model, we get robust equivalent to MSEofMean.
"RReliefFdistance": RReliefF algorithm where k nearest instances are weighed directly with its inverse distance from the selected instance. Usually using ranks instead of distance as in RReliefFexpRank is more effective.
"RReliefFsqrDistance": RReliefF algorithm where k nearest instances are weighed with its inverse square distance from the selected instance.

There are some additional parameters … available which are used by specific evaluation heuristics. Their list and short description is available by calling helpCore. See Section on attribute evaluation.

The attributes can also be evaluated via random forest out-of-bag set with function rfAttrEval.

Evaluation and visualization of ordered attributes is covered in function ordEval.

References

Marko Robnik-Sikonja, Igor Kononenko: Theoretical and Empirical Analysis of ReliefF and RReliefF. Machine Learning Journal, 53:23-69, 2003

Marko Robnik-Sikonja: Experiments with Cost-sensitive Feature Evaluation. In Lavrac et al.(eds): Machine Learning, Proceedings of ECML 2003, Springer, Berlin, 2003, pp. 325-336

Igor Kononenko: On Biases in Estimating Multi-Valued Attributes. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'95), pp. 1034-1040, 1995

Some of these references are available also from http://lkm.fri.uni-lj.si/rmarko/papers/

Examples

Run this code

# NOT RUN {
# use iris data

# run method ReliefF with exponential rank distance  
estReliefF <- attrEval(Species ~ ., iris, 
                       estimator="ReliefFexpRank", ReliefIterations=30)
print(estReliefF)

# alternatively and more appropriate for large data sets 
# one can specify just the target variable
# estReliefF <- attrEval("Species", iris, estimator="ReliefFexpRank",
#                        ReliefIterations=30)

# print all available estimators
infoCore(what="attrEval")
# }

Run the code above in your browser using DataLab