manova.update: MANOVA update for lm.rrpp model fits

Description

Function updates a lm.rrpp fit to add $MANOVA, which like $ANOVA, provides statistics or matrices typically associated with multivariate analysis of variance (MANOVA).

MANOVA statistics or sums of squares and cross-products (SSCP) matrices are calculated over the random permutations of a lm.rrpp fit. SSCP matrices are computed, as are the inverse of R times H (invR.H), where R is a SSCP for the residuals or random effects and H is the difference between SSCP matrices of full and reduced models (see below). From invR.H, MANOVA statistics are calculated, including Roy's maximum root (eigenvalue), Pillai trace, Hotelling-Lawley trace, and Wilks lambda (via summary.manova.lm.rrpp).

The manova.update to add $MANOVA to lm.rrpp fits requires more computation time than the $ANOVA statistics that are computed automatically in lm.rrpp. Generally, the same inferential conclusions will be found with either approach, when observations outnumber response variables. For high-dimensional data (more variables than observations) data are projected into a Euclidean space of appropriate dimensions (rank of residual covariance matrix). One can vary the tolerance for eigenvalue decay or specify the number of PCs, if a smaller set of PCs than the maximum is desired. This is advised if there is strong correlation among variables (the data space could be simplified to fewer dimensions), as spurious results are possible. Because distributions of MANOVA stats can be generated from the random permutations, there is no need to approximate F-values, like with parametric MANOVA. By restricting analysis to the real, positive eigenvalues calculated, all statistics can be calculated (but Wilks lambda, as a product but not a trace, might be less reliable as variable number approaches the number of observations).

ANOVA vs. MANOVA

Two SSCP matrices are calculated for each linear model effect, for every random permutation: R (Residuals or Random effects) and H, the difference between SSCPs for "full" and "reduced" models. (Full models contain and reduced models lack the effect tested; SSCPs are hypothesized to be the same under a null hypothesis, if there is no effect. The difference, H, would have a trace of 0 if the null hypothesis were true.) In RRPP, ANOVA and MANOVA correspond to two different ways to calculate statistics from R and H matrices.

ANOVA statistics are those that find the trace of R and H SSCP matrices before calculating subsequent statistics, including sums of squares (SS), mean squares (MS), and F-values. These statistics can be calculated with univariate data and provide univariate-like statistics for multivariate data. These statistics are dispersion measures only (covariances among variables do not contribute) and are the same as "distance-based" stats proposed by Goodall (1991) and Anderson (2001). MANOVA stats require multivariate data and are implicitly affected by variable covariances. For MANOVA, the inverse of R times H (invR.H) is first calculated for each effect, then eigen-analysis is performed on these matrix products. Multivariate statistics are calculated from the positive, real eigenvalues. In general, inferential conclusions will be similar with either approach, but effect sizes might differ.

Two important differences between manova.update and summary.manova (for lm objects) are that manova.update does not attempt to normalize residual SSCP matrices (unneeded for non-parametric statistical solutions) and (2) uses a generalized inverse of the residual SSCP, if needed, when the number of variables could render eigen-analysis problematic. This approach is consistent with covariance regularization methods that attempt to make covariance matrices positive-definite for calculating model likelihoods or multivariate statistics. If the number of observations far exceeds the number of response variables, observed statistics from manova.update and summary.manova will be quite similar. If the number of response variables approaches or exceeds the number of observations, manova.update statistics will be much more reliable.

ANOVA tables are generated by anova.lm.rrpp on lm.rrpp fits and MANOVA tables are generated by summary.manova.lm.rrpp, after running manova.update on lm.rrpp fits.

Currently, mixed model effects are only possible with $ANOVA statistics, not $MANOVA.

More detail is found in the vignette, ANOVA versus MANOVA.

Usage

manova.update(
  fit,
  error = NULL,
  tol = 1e-07,
  PC.no = NULL,
  print.progress = TRUE,
  verbose = NULL
)

Value

An object of class lm.rrpp is updated to include class manova.lm.rrpp, and the object, $MANOVA, which includes

SSCP: Terms and Model SSCP matrices.
invR.H: The inverse of the residuals SSCP times the H SSCP.
eigs: The eigenvalues of invR.H.
e.rank: Rank of the error (residuals) covariance matrix. Currently NULL only.
PCA: Principal component analysis of data, using either tol or PC.no.
manova.pc.dims: Resulting number of PC vectors in the analysis.
e.rank: Rank of the residual (error) covariance matrix, irrespective of the number of dimensions used for analysis.

Arguments

fit: Linear model fit from lm.rrpp
error: An optional character string to define R matrices used to calculate invR.H. (Currently only Residuals can be used and this argument defaults to NULL. Future versions will update this argument.)
tol: A tolerance value for culling data dimensions to prevent spurious results. The distribution of eigenvalues for the data will be examined and if the decay becomes less than the tolerance, the data will be truncated to principal components ahead of this point. This will possibly prevent spurious results calculated from eigenvalues near 0. If tol = 0, all possible PC axes are used, which is likely not a problem if observations outnumber variables. If tol = 0 and the number of variables exceeds the number of observations, the value of tol will be made slightly positive to prevent problems with eigen-analysis.
PC.no: A value that, if not NULL, can override the tolerance argument, and forces a desired number of data PCs to use for analysis. If a value larger than the possible number of PCs is chosen, the full compliment of PCs (the full data space) will be used. If a number larger than tol would permit is chosen, the minimum number of PCs between the tol argument and PC.no argument is returned.
print.progress: A logical value to indicate whether a progress bar should be printed to the screen. This is helpful for long-running analyses.
verbose: Either a NULL or logical value for whether to retain all MANOVA result (if TRUE). If NULL, the verbose argument used for the lm.rrpp is retained for MANOVA update. Essentially, verbose indicates whether to retain all SSCP matrices and all invR.H matrices, for every model effect, in every RRPP iteration.

Author

Michael Collyer

References

Goodall, C.R. 1991. Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society B 53:285-339.

Anderson MJ. 2001. A new method for non-parametric multivariate analysis of variance. Austral Ecology 26: 32-46.

Examples

Run this code

if (FALSE) {
# Body Shape Analysis (Multivariate) ----------------

data(Pupfish)

# Although not recommended as a practice, this example will use only
# three principal components of body shape for demonstration.  
# A larger number of random permutations should also be used.

Pupfish$shape <- ordinate(Pupfish$coords)$x[, 1:3]

Pupfish$logSize <- log(Pupfish$CS) 

fit <- lm.rrpp(shape ~ logSize + Sex, SS.type = "I", 
data = Pupfish, print.progress = FALSE, iter = 499) 
summary(fit, formula = FALSE)
anova(fit) # ANOVA table

# MANOVA

fit.m <- manova.update(fit, print.progress = FALSE, tol = 0.001)
summary(fit.m, test = "Roy")
summary(fit.m, test = "Pillai")

fit.m$MANOVA$eigs$obs # observed eigenvalues
fit.m$MANOVA$SSCP$obs # observed SSCP
fit.m$MANOVA$invR.H$obs # observed invR.H 

# Distributions of test statistics

summ.roy <- summary(fit.m, test = "Roy")
dens <- apply(summ.roy$rand.stats, 1, density)
par(mfcol = c(1, length(dens)))
for(i in 1:length(dens)) {
     plot(dens[[i]], xlab = "Roy max root", ylab = "Density",
     type = "l", main = names(dens)[[i]])
     abline(v = summ.roy$rand.stats[1, i], col = "red")
}
par(mfcol = c(1,1))
}

Run the code above in your browser using DataLab