rda: Regularized Discriminant Analysis (RDA)

Description

Builds a classification rule using regularized group covariance matrices that are supposed to be more robust against multicollinearity in the data.

Usage

rda(x, ...)
# S3 method for default
rda(x, grouping = NULL, prior = NULL, gamma = NA, 
    lambda = NA, regularization = c(gamma = gamma, lambda = lambda), 
    crossval = TRUE, fold = 10, train.fraction = 0.5, 
    estimate.error = TRUE, output = FALSE, startsimplex = NULL, 
    max.iter = 100, trafo = TRUE, simAnn = FALSE, schedule = 2, 
    T.start = 0.1, halflife = 50, zero.temp = 0.01, alpha = 2, 
    K = 100, ...)
# S3 method for formula
rda(formula, data, ...)

Value

A list of class rda containing the following components:

call: The (matched) function call.
regularization: vector containing the two regularization parameters (gamma, lambda)
classes: the names of the classes
prior: the prior probabilities for the classes
error.rate: apparent error rate (if computation was not suppressed), and, if any optimization took place, the final (cross-validated or bootstrapped) error rate estimate as well.
means: Group means.
covariances: Array of group covariances.
covpooled: Pooled covariance.
converged: (Logical) indicator of convergence (only for Nelder-Mead).
iter: Number of iterations actually performed (only for Nelder-Mead).

Arguments

x: Matrix or data frame containing the explanatory variables (required, if formula is not given).
formula: Formula of the form ‘groups ~ x1 + x2 + ...’.
data: A data frame (or matrix) containing the explanatory variables.
grouping: (Optional) a vector specifying the class for each observation; if not specified, the first column of ‘data’ is taken.
prior: (Optional) prior probabilities for the classes. Default: proportional to training sample sizes. “prior=1” indicates equally likely classes.
gamma, lambda, regularization: One or both of the rda-parameters may be fixed manually. Unspecified parameters are determined by minimizing the estimated error rate (see below).
crossval: Logical. If TRUE, in the optimization step the error rate is estimated by Cross-Validation, otherwise by drawing several training- and test-samples.
fold: The number of Cross-Validation- or Bootstrap-samples to be drawn.
train.fraction: In case of Bootstrapping: the fraction of the data to be used for training in each Bootstrap-sample; the remainder is used to estimate the misclassification rate.
estimate.error: Logical. If TRUE, the apparent error rate for the final parameter set is estimated.
output: Logical flag to indicate whether text output during computation is desired.
startsimplex: (Optional) a starting simplex for the Nelder-Mead-minimization.
max.iter: Maximum number of iterations for Nelder-Mead.
trafo: Logical; indicates whether minimization is carrried out using transformed parameters.
simAnn: Logical; indicates whether Simulated Annealing shall be used.
schedule: Annealing schedule 1 or 2 (exponential or polynomial).
T.start: Starting temperature for Simulated Annealing.
halflife: Number of iterations until temperature is reduced to a half (schedule 1).
zero.temp: Temperature at which it is set to zero (schedule 1).
alpha: Power of temperature reduction (linear, quadratic, cubic,...) (schedule 2).
K: Number of iterations until temperature = 0 (schedule 2).
...: currently unused

More details

The explicit defintion of $\gamma$, $\lambda$ and the resulting covariance estimates is as follows:

The pooled covariance estimate $\hat{\Sigma}$ is given as well as the individual covariance estimates $\hat{\Sigma}_k$ for each group.

First, using $\lambda$, a convex combination of these two is computed: $$\hat{\Sigma}_k (\lambda) := (1-\lambda) \hat{\Sigma}_k + \lambda \hat{\Sigma}.$$ Then, another convex combination is constructed using the above estimate and a (scaled) identity matrix: $$\hat{\Sigma}_k (\lambda,\gamma) = (1-\gamma)\hat{\Sigma}_k(\lambda)+ \gamma\frac{1}{d}\mathrm{tr}[\hat{\Sigma}_k(\lambda)]\mathrm{I}.$$ The factor $\frac{1}{d}\mathrm{tr}[\hat{\Sigma}_k(\lambda)]$ in front of the identity matrix I is the mean of the diagonal elements of $\hat{\Sigma}_k(\lambda)$, so it is the mean variance of all $d$ variables assuming the group covariance $\hat{\Sigma}_k(\lambda)$.

For the four extremes of ($\gamma$,$\lambda$) the covariance structure reduces to special cases:

($\gamma=0$, $\lambda=0$): QDA - individual covariance for each group.
($\gamma=0$, $\lambda=1$): LDA - a common covariance matrix.
($\gamma=1$, $\lambda=0$): Conditional independent variables - similar to Naive Bayes, but variable variances within group (main diagonal elements) are equal.
($\gamma=1$, $\lambda=1$): Classification using euclidean distance - as in previous case, but variances are the same for all groups. Objects are assigned to group with nearest mean.

Author

Christian Röver, roever@statistik.tu-dortmund.de

Details

J.H. Friedman (see references below) suggested a method to fix almost singular covariance matrices in discriminant analysis. Basically, individual covariances as in QDA are used, but depending on two parameters ($\gamma$ and $\lambda$), these can be shifted towards a diagonal matrix and/or the pooled covariance matrix. For ($\gamma=0$, $\lambda=0$) it equals QDA, for ($\gamma=0$, $\lambda=1$) it equals LDA.

You may fix these parameters at certain values or leave it to the function to try to find “optimal” values. If one parameter is given, the other one is determined using the R-function ‘optimize’. If no parameter is given, both are determined numerically by a Nelder-Mead-(Simplex-)algorithm with the option of using Simulated Annealing. The goal function to be minimized is the (estimated) misclassification rate; the misclassification rate is estimated either by Cross-Validation or by repeatedly dividing the data into training- and test-sets (Boostrapping).

Warning: If these sets are small, optimization is expected to produce almost random results. We recommend to adjust the parameters manually in such a case. In all other cases it is recommended to run the optimization several times in order to see whether stable results are gained.

Since the Nelder-Mead-algorithm is actually intended for continuous functions while the observed error rate by its nature is discrete, a greater number of Boostrap-samples might improve the optimization by increasing the smoothness of the response surface (and, of course, by reducing variance and bias). If a set of parameters leads to singular covariance matrices, a penalty term is added to the misclassification rate which will hopefully help to maneuver back out of singularity (so do not worry about error rates greater than one during optimization).

References

Friedman, J.H. (1989): Regularized Discriminant Analysis. In: Journal of the American Statistical Association 84, 165-175.

Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T. (1992): Numerical Recipes in C. Cambridge: Cambridge University Press.

Examples

Run this code

data(iris)
x <- rda(Species ~ ., data = iris, gamma = 0.05, lambda = 0.2)
predict(x, iris)

Run the code above in your browser using DataLab