HiDimDA-package: High Dimensional Discriminant Analysis

Description

Performs Linear Discriminant Analysis in High Dimensional problems based on reliable covariance estimators for problems with (many) more variables than observations. Includes routines for classifier training, prediction, cross-validation and variable selection.

Arguments

Author

Antonio Pedro Duarte Silva <psilva@porto.ucp.pt>

Maintainer: Antonio Pedro Duarte Silva <psilva@porto.ucp.pt>

Details

Package:	HiDimDA
Type:	Package
Version:	0.2-7
Date:	2024-10-06
License:	GPL-3
LazyLoad:	yes
LazyData:	yes

HiDimDA is a package for High-Dimensional Discriminant Analysis aimed at problems with many variables, possibly much more than the number of available observations. Its core consists of the four Linear Discriminant Analyis routines:

Dlda:	Diagonal Linear Discriminant Analysis
Slda:	Shrunken Linear Discriminant Analysis
Mlda:	Maximum-uncertainty Linear Discriminant Analysis
RFlda:	Factor-model Linear Discriminant Analysis

and the variable selection routine:

SelectV:

High-Dimensional variable selection for supervised classification

that selects variables to be used in a Discriminant classification rule by ranking them according to two-sample t-scores (problems with two-groups), or ANOVA F-scores (problems wih more that two groups), and discarding those with scores below a threshold defined by the Higher Criticism (HC) approach of Donoho and Jin (2008), the Expanded Higher Criticism scheme proposed by Duarte Silva (2011), False Discovery Rate (Fdr) control as suggested by Benjamini and Hochberg (1995), the FAIR approach of Fan and Fan (2008), or simply by fixing the number of retained variables to some pre-defined constant.

All four discriminant routines, ‘Dlda’, ‘Slda’, ‘Mlda’ and ‘RFlda’, compute Linear Discriminant Functions, by default after a preliminary variable selection step, based on alternative estimators of a within-groups covariance matrix that leads to reliable allocation rules in problems where the number of selected variables is close to, or larger than, the number of available observations.

Consider a Discriminant Analysis problem with $k$ groups, $p$ selected variables, a training sample consisting of $N = \sum_{g=1}^{k}n_g$ observations with group and overall means, $\bar{X}_g$ and $\bar{X}_.$, and a between-groups scatter (scaled by degrees of freedom) matrix, $S_B = \frac{1}{N-k} \sum_{g=1}^{k} n_g (\bar{X}_g -\bar{X}_.)(\bar{X}_g -\bar{X}_.)^T $

Following the two main classical approaches to Linear Discrimant Analysis, the Discriminant Functions returned by HiDimDA discriminant routines are either based on the canonical linear discriminants given by the normalized eigenvectors

$$LD_j = Egvct_j (S_B \hat{\Sigma}_W^{-1})$$ $$j = 1,...,r=min(p,k-1)$$ $$[LD_1, ..., LD_r]^T \hat{\Sigma}_W [LD_1, ..., LD_r] = I_r $$

or the classification functions

$$CF_g = (\bar{X}_g - \bar{X}_1) \hat{\Sigma}_W^{-1}$$ $$g = 2,...,k$$

where $\hat{\Sigma}_W^{-1}$ is an estimate of the inverse within-groups covariance.

It is well known that these two approaches are equivalent, in the sense that classification rules that assign new observations to the group with the closest (according to the Euclidean distance) centroid in the space of the canonical variates, $Z = [LD_1 ... LD_r]^T X $, give the same results as the rule that assigns a new observation to group 1 if all classification scores, $Clscr_g = CF_g^T X - CF_g^T \frac{(\bar{X}_1 + \bar{X}_g)}{2} $, are negative, and to the group with the highest classification score otherwise.

The discriminant routines of HiDimDA compute canonical linear discriminant functions by default, and classification functions when the argument ‘ldafun’ is set to “classification”. However, unlike traditional linear discriminant analysis where $\Sigma_W^{-1}$ is estimated by the inverse of the sample covariance, which is not well-defined when $p \geq N-k$ and is unreliable if $p$ is close to $N-k$, the routines of HiDimDA use four alternative well-conditioned estimators of $\Sigma_W^{-1}$ that lead to reliable classification rules if $p$ is larger than, or close to, $N-k$.

In particular, ‘Dlda’ estimates $\Sigma_W^{-1}$ by the diagonal matrix of inverse sample variances, ‘Slda’ by the inverse of an optimally shrunken Ledoit and Wolf's (2004) covariance estimate with the targets and optimal target intensity estimators proposed by Fisher and Sun (2011), ‘Mlda’ uses a regularized inverse covariance that deemphasizes the importance given to the last eigenvectors of the sample covariance (see Thomaz, Kitani and Gillies (2006) for details), and ‘RFlda’ uses a factor model estimate of the true inverse correlation (or covariance) matrix based on the approach of Duarte Silva (2011).

The HiDimDA package also includes predict methods for all discriminant routines implemented, a routine (‘DACrossVal’) for asssessing the quality of the classification results by kfold cross-validation, and utilities for storing, extracting and efficiently handling specialized high-dimensional covariance and inverse covariance matrix estimates.

References

Benjamini, Y. and Hochberg, Y. (1995) “Controling the false discovery rate: A practical and powerful approach to multiple testing”, Journal of the Royal Statistical Society B, 57, 289-300.

Donoho, D. and Jin, J. (2008) “Higher criticism thresholding: Optimal feature selection when useful features are rare and weak”, In: Proceedings National Academy of Sciences, USA 105, 14790-14795.

Fan, J. and Fan, Y. (2008) “High-dimensional classification using features annealed independence rules”, Annals of Statistics, 36 (6), 2605-2637.

Fisher, T.J. and Sun, X. (2011) “Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix”, Computational Statistics and Data Analysis, 55 (1), 1909-1918.

Ledoit, O. and Wolf, M. (2004) “A well-conditioned estimator for large-dimensional covariance matrices.”, Journal of Multivariate Analysis, 88 (2), 365-411.

Pedro Duarte Silva, A. (2011) “Two Group Classification with High-Dimensional Correlated Data: A Factor Model Approach”, Computational Statistics and Data Analysis, 55 (1), 2975-2990.

Thomaz, C.E. Kitani, E.C. and Gillies, D.F. (2006) “A maximum uncertainty LDA-based approach for limited sample size problems - with application to face recognition”, Journal of the Brazilian Computer Society, 12 (2), 7-18

Examples

Run this code


# train the four main classifiers with their default setings 
# on Alon's colon data set (after a logarithmic transformation), 
# selecting genes by the Expanded HC scheme 

# Pre-process and select the genes to be used in the classifiers

log10genes <- log10(AlonDS[,-1]) 
SelectionRes <- SelectV(log10genes,AlonDS$grouping)
genesused <- log10genes[SelectionRes$vkpt]

# Train classifiers

DiaglldaRule <- Dlda(genesused,AlonDS$grouping)     
FactldaRule <- RFlda(genesused,AlonDS$grouping)     
MaxUldaRule <- Mlda(genesused,AlonDS$grouping)     
ShrkldaRule <- Slda(genesused,AlonDS$grouping)     

# Get in-sample classification results

predict(DiaglldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class         	       
predict(FactldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class         	       
predict(MaxUldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class         	       
predict(ShrkldaRule,genesused,grpcodes=levels(AlonDS$grouping))$class         	       

# Compare classifications with true assignments

cat("Original classes:\n")
print(AlonDS$grouping)             		 

# Show set of selected genes

cat("Genes kept in discrimination rule:\n")
print(colnames(genesused))             		 
cat("Number of selected genes =",SelectionRes$nvkpt,"\n")