ProcessData: Select the subset of features

Description

The auxiliary function performs the discretization of the numerical features and is called from the several functions for feature selection. The discretization options include minimal description length (MDL), equal frequency and equal interval width methods. The results is in the form of “list”, consisting of two fields: the processed dataset and the column numbers of the features. When the value of the input parameter “flag”=TRUE the second field will include the column numbers of the features, which have more than single interval after discretization.

Usage

ProcessData(matrix,disc.method,attrs.nominal,flag=FALSE)

Arguments

matrix

a dataset, a matrix of feature values for several cases, the last column is for the class labels. Class labels could be numerical or character values. The maximal number of classes is ten.

disc.method

a method used for feature discretization.The discretization options include minimal description length (MDL), equal frequency and equal interval width methods.

attrs.nominal

a numerical vector, containing the column numbers of the nominal features, selected for the analysis.

flag

a binary logical value. If equals TRUE the output list will contain the processed dataset with the features, having more than one interval after discretization together with their names. In the case of FALSE value the processed dataset with all the features will be returned.

Value

The data can be provided with reasonable number of missing values that must be at first preprocessed with one of the imputing methods in the function input_miss.

A returned list consists of the the following fields:

a processed dataset

sel.feature

a numeric vector with the column numbers of the features, having more than one interval value (when “flag”=TRUE). If “flag”=FALSE it return all the column numbers of the dataset.

Details

This auxiliary function's main job is to descritize the numerical features using the one of the discretization methods. See the “Value” section to this page for more details.

Data can be provided in matrix form, where the rows correspond to cases with feature values and class label. The columns contain the values of individual features and the last column must contain class labels. The maximal number of class labels equals 10. The class label features and all the nominal features must be defined as factors.

References

H. Liu, F. Hussain, C. L. Tan, and M. Dash, "Discretization: An enabling technique," Data Mining and Knowledge Discovery, Vol. 6, No. 4, 2002, pp. 393-423.

Examples

Run this code

# NOT RUN {
# example for dataset without missing values
data(data_test)

# class label must be factor
data_test[,ncol(data_test)]<-as.factor(data_test[,ncol(data_test)])

disc<-"MDL"
attrs.nominal=numeric()
flag=FALSE
out=ProcessData(matrix=data_test,disc.method=disc,
attrs.nominal=attrs.nominal,flag=flag)
# }

Run the code above in your browser using DataLab