Package: | HiDimDA |
Type: | Package |
Version: | 0.2-7 |
Date: | 2024-10-06 |
License: | GPL-3 |
LazyLoad: | yes |
LazyData: | yes |
HiDimDA is a package for High-Dimensional Discriminant Analysis aimed at problems with many variables, possibly much more
than the number of available observations. Its core consists of the four Linear Discriminant Analyis routines:
Dlda: | Diagonal Linear Discriminant Analysis |
Slda: | Shrunken Linear Discriminant Analysis |
Mlda: | Maximum-uncertainty Linear Discriminant Analysis |
RFlda: | Factor-model Linear Discriminant Analysis |
and the variable selection routine:
SelectV: | High-Dimensional variable selection for supervised classification |
that selects variables to be used in a Discriminant classification rule by
ranking them according to two-sample t-scores (problems with two-groups),
or ANOVA F-scores (problems wih more that two groups), and discarding those
with scores below a threshold defined by the Higher Criticism (HC) approach
of Donoho and Jin (2008), the Expanded Higher Criticism scheme
proposed by Duarte Silva (2011), False Discovery Rate (Fdr) control as suggested by
Benjamini and Hochberg (1995), the FAIR approach of Fan and Fan (2008), or simply by
fixing the number of retained variables to some pre-defined constant.
All four discriminant routines, ‘Dlda’, ‘Slda’, ‘Mlda’ and ‘RFlda’, compute Linear
Discriminant Functions, by default after a preliminary variable selection step, based on alternative estimators of
a within-groups covariance matrix that leads to reliable allocation rules in problems where the number of selected
variables is close to, or larger than, the number of available observations.
Consider a Discriminant Analysis problem with \(k\) groups, \(p\) selected variables, a training sample consisting
of \(N = \sum_{g=1}^{k}n_g\) observations with group and overall means,
\(\bar{X}_g\) and \(\bar{X}_.\), and a between-groups scatter (scaled by degrees of freedom)
matrix, \(S_B = \frac{1}{N-k} \sum_{g=1}^{k} n_g (\bar{X}_g -\bar{X}_.)(\bar{X}_g -\bar{X}_.)^T \)
Following the two main classical approaches to Linear Discrimant Analysis, the Discriminant Functions returned by HiDimDA discriminant
routines are either based on the canonical linear discriminants given by the normalized eigenvectors
$$LD_j = Egvct_j (S_B \hat{\Sigma}_W^{-1})$$
$$j = 1,...,r=min(p,k-1)$$
$$[LD_1, ..., LD_r]^T \hat{\Sigma}_W [LD_1, ..., LD_r] = I_r $$
or the classification functions
$$CF_g = (\bar{X}_g - \bar{X}_1) \hat{\Sigma}_W^{-1}$$
$$g = 2,...,k$$
where \(\hat{\Sigma}_W^{-1}\) is an estimate of the inverse within-groups covariance.
It is well known that these two approaches are equivalent, in the sense that classification rules that assign new observations to
the group with the closest (according to the Euclidean distance) centroid in the space of the canonical variates,
\(Z = [LD_1 ... LD_r]^T X \), give the same results as the rule that assigns a new observation to group 1 if all classification scores,
\(Clscr_g = CF_g^T X - CF_g^T \frac{(\bar{X}_1 + \bar{X}_g)}{2} \), are negative, and to the group with the highest classification
score otherwise.
The discriminant routines of HiDimDA compute canonical linear discriminant functions by default, and classification functions when
the argument ‘ldafun’ is set to “classification”. However, unlike traditional linear discriminant analysis where
\(\Sigma_W^{-1}\) is estimated by the inverse of the sample covariance,
which is not well-defined when \(p \geq N-k\) and is unreliable if \(p\) is close to \(N-k\), the routines of HiDimDA use
four alternative well-conditioned estimators of \(\Sigma_W^{-1}\) that lead to reliable classification rules if \(p\) is larger than,
or close to, \(N-k\).
In particular, ‘Dlda’ estimates \(\Sigma_W^{-1}\) by the diagonal matrix of inverse sample variances, ‘Slda’ by
the inverse of an optimally shrunken Ledoit and Wolf's (2004) covariance estimate with the targets and optimal
target intensity estimators proposed by Fisher and Sun (2011), ‘Mlda’ uses a regularized inverse
covariance that deemphasizes the importance given to the last eigenvectors of the sample covariance (see Thomaz, Kitani
and Gillies (2006) for details), and ‘RFlda’ uses a factor model estimate of the true inverse correlation (or covariance)
matrix based on the approach of Duarte Silva (2011).
The HiDimDA package also includes predict methods for all discriminant routines implemented, a routine (‘DACrossVal’) for asssessing
the quality of the classification results by kfold cross-validation, and utilities for storing, extracting and efficiently handling specialized high-dimensional covariance and inverse covariance matrix estimates.