Biocomb-package: Tools for Data Mining

Description

Functions to make the data analysis with the emphasis on biological data. They can deal with both numerical and nominal features. Biocomb includes functions for several feature ranking, feature selection algorithms. The feature ranking is based on several criteria: information gain, symmetrical uncertainty, chi-squared statistic etc. There are a number of features selection algorithms: Chi2 algorithm, based on chi-squared test, fast correlation-based filter algorithm, feature weighting algorithm (RelieF), sequential forward search algorithm (CorrSF), Correlation-based feature selection algorithm (CFS). Package includes several classification algorithms with embedded feature selection and validation schemes. It includes also the functions for calculation of feature AUC (Area Under the ROC Curve) values with statistical significance analysis, calculation of Area Above the RCC (AAC) values. For two- and multi-class problems it is possible to use functions for HUM (hypervolume under manifold) calculation and construction 2D- and 3D- ROC curves. Relative Cost Curves (RCC) are provided to estimate the classifier performance under unequal misclassification costs. Biocomb has a special function to deal with missing values, including different imputing schemes.

Arguments

Function

`select.process`	Perform the features ranking or features selection
`compute.aucs`	Calculate the AUC values
`select.inf.gain`	Calculate the Information Gain criterion
`select.inf.symm`	Calculate the Symmetrical uncertainty criterion
`select.inf.chi2`	Calculate the chi-squared statistic
`select.fast.filter`	Select the feature subset with fast correlation-based filter method
`chi2.algorithm`	Select the feature subset with Chi2 discretization algorithm.
`select.forward.Corr`	Select the feature subset with forward search strategy and correlation measure
`select.forward.wrapper`	Select the feature subset with a wrapper method
`ProcessData`	Perform the discretization of the numerical features
`classifier.loop`	Perform the classification with the embedded feature selection
`pauc`	Calculate the p-values of the statistical significance of the two-class difference
`pauclog`	Calculate the logarithm of p-values of the statistical significance
`compute.auc.permutation`	Compute the p-value of the significance of the AUC using the permutation test
`compute.auc.random`	Compute the p-value of the significance of the AUC using random sample generation
`plotRoc.curves`	Plot the ROC curve in 2D-space
`CalculateHUM_seq`	Calculate a maximal HUM value and the corresponding permutation of class labels
`CalculateHUM_Ex`	Calculate the HUM values with exaustive serach for specified number of class labels
`CalculateHUM_ROC`	Function to construct and plot the 2D- or 3d-ROC curve
`CalcGene`	Compute the HUM value for one feature
`CalcROC`	Compute the point coordinates to plot the 2D- or 3D-ROC curve
`CalculateHUM_Plot`	Plot the 2D-ROC curve
`Calculate3D`	Plot the 3D-ROC curve
`cost.curve`	Plot the RCC and calculate the AAC for unequal misclassification costs
`input_miss`	Perform the missing values imputation
`generate.data.miss`	Generate the dataset with missing values

Dataset

This package comes with two simulated datasets and a real dataset of leukemia patients with 72 cases and 101 features. The last feature is the class (disease labels).

Installing and using

To install this package, make sure you are connected to the internet and issue the following command in the R prompt:

    install.packages("Biocomb")

To load the package in R:

    library(Biocomb)

Details

Package:	Biocomb
Type:	Package
Version:	0.3
Date:	2016-08-14
License:	GPL (>= 3)

Biocomb package presents the functions for two stages of data mining process: feature selection and classification. One of the main functions of Biocomb is the select.process function. It presents the infrostructure to perform the feature ranking or feature selection for the data set with two or more class labels. Functions compute.aucs, select.inf.gain, select.inf.symm and select.inf.chi2 calculate the different criterion measure for each feature in the dataset. Function select.fast.filter realizes the fast correlation-based filter method. Function chi2.algorithm performes Chi2 discretization algorithms with feature selection. Function select.forward.Corr is designed for the sequential forward features search according to the correlation measure. Function select.forward.wrapper is the realization of the wrapper feature selection method with sequential forward search strategy. The auxiliary function ProcessData performs the discretization of the numerical features and is called from the several functions for feature selection. The second main function of the Biocomb is classifier.loop which presents the infrastructure for the classifier construction with the embedded feature selection and using the different schemes for the performance validation. The functions compute.aucs, compute.auc.permutation, pauc, pauclog, compute.auc.random are the functions for calculation of feature AUC (Area Under the ROC Curve) values with statistical significance analysis. The functions plotRoc.curves is assigned for the construction of the ROC curve in 2D-space. The functions cost.curve plots the RCC and calculates the corresponding AAC to estimate the classifier performance under unequal misclassification costs problem. The function input_miss deals with missing value problem and realizes the two methods of missing value imputing. The function generate.data.miss allows to generate the dataset with missing values from the input dataset in order to test the algorithms, which are designed to deal with missing values problem. The functions CalculateHUM_seq, CalculateHUM_ROC, CalculateHUM_Plot are for HUM calculation and construction 2D- and 3D- ROC curves.

References

H. Liu and L. Yu. "Toward Integrating Feature Selection Algorithms for Classification and Clustering", IEEE Trans. on Knowledge and Data Engineering, pdf, 17(4), 491-502, 2005. L. Yu and H. Liu. "Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution". In Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03), Washington, D.C. pp. 856-863. August 21-24, 2003. Y. Wang, I.V. Tetko, M.A. Hall, E. Frank, A. Facius, K.F.X. Mayer, and H.W. Mewes, "Gene Selection from Microarray Data for Cancer Classification?A Machine Learning Approach," Computational Biology and Chemistry, vol. 29, no. 1, pp. 37-46, 2005. Olga Montvida and Frank Klawonn Relative cost curves: An alternative to AUC and an extension to 3-class problems,Kybernetika 50 no. 5, 647-660, 2014

Examples

Run this code

# NOT RUN {
data(data_test)
# class label must be factor
data_test[,ncol(data_test)]<-as.factor(data_test[,ncol(data_test)])

# Perform the feature selection using the fast correlation-based filter algorithm
disc="MDL"
threshold=0.2
attrs.nominal=numeric()
out=select.fast.filter(data_test,disc.method=disc,threshold=threshold,
attrs.nominal=attrs.nominal)

# Perform the classification with cross-validation of results
out=classifier.loop(data_test,classifiers=c("svm","lda","rf"),
 feature.selection="auc", flag.feature=FALSE,method.cross="fold-crossval")

# Calculate the coordinates for 2D- or 3D- ROC curve and the optimal threshold point
# }
# NOT RUN {
data(data_test)
xllim<--4
xulim<-4
yllim<-30
yulim<-110

attrs.no=1
pos.Class<-levels(data_test[,ncol(data_test)])[1]
add.legend<-TRUE

aacs<-rep(0,length(attrs.no))
color<-c(1:length(attrs.no))

out <- cost.curve(data_test,attrs.no, pos.Class,col=color[1],add=F,
 xlim=c(xllim,xulim),ylim=c(yllim,yulim))
# }

Run the code above in your browser using DataLab