anova.manyglm: Analysis of Deviance for Multivariate Generalized Linear Model Fits for Abundance Data

Description

Compute an analysis of deviance table for one or more multivariate generalized linear model fits.

Usage


# S3 method for manyglm
anova(object, ..., resamp="pit.trap", test="LR", p.uni="none",
    nBoot=999, cor.type=object$cor.type,
    pairwise.comp = NULL,
    block=NULL, show.time="total",
    show.warning=FALSE, rep.seed=FALSE, bootID=NULL, keep.boot=FALSE) 
# S3 method for anova.manyglm
print(x, ...)

Value

family: the family component from object.
p.uni: the p.uni argument supplied.
test: the test argument supplied.
cor.type: the cor.type argument supplied.
resamp: the resamp argument supplied.
nBoot: the nBoot argument supplied.
shrink.parameter: a list of shrink parameters from all manyglm objects in the anova test.
n.bootsdone: the number of bootstrapping iterations that were done, i.e. had no error.
table: the table with Residual Degrees of Freedom, Degrees of Freedom, the Test Statistics and the P values.
block: any block argument specified as an inpout argument.
pairwise.comp: The pairwise.comp argument supplied.

If p.uni="adjusted" or "unadjusted" the output list also contains

uni.test: a table showing the test statistics of the univariate tests.
uni.p: a table showing the p-values of the univariate tests.

If keep.boot=TRUE the output list also contains

bootStat: A matrix of boot strapped test statistics, the first column is the multivariate test statistic, the rest of the columns are the univariate statistic.

If !is.null(parwise.comp) the output list also contains

pairwise.comp.table: A data.frame containing the comparisons, the observed test statistcs and the holm free stepdown adjusted p-values.

Arguments

object: objects of class manyglm, typically the result of a call to manyglm.
...: for the anova.manyglm method, these are optional further objects of class manyglm, which are usually a result of a call to manyglm for the print.anova.manyglm method these are optional further arguments passed to or from other methods. See print.summary.glm for more details.
resamp: the method of resampling used. Can be one of "case", "perm.resid", "montecarlo" or "pit.trap" (default). See Details.
test: the test to be used. If cor.type="I", this can be one of "wald" for a Wald-Test or "score" for a Score-Test or "LR" for a Likelihood-Ratio-Test, otherwise only "wald" and "score" is allowed. The default value is "LR".
p.uni: whether to calculate univariate test statistics and their P-values, and if so, what type. This can be one of the following options.
"none" = No univariate P-values (default)
"unadjusted" = A test statistic and (ordinary unadjusted) P-value are reported for each response variable.
"adjusted" = Univariate P-values are adjusted for multiple testing, using a step-down resampling procedure.
nBoot: the number of Bootstrap iterations, default is nBoot=999.
cor.type: structure imposed on the estimated correlation matrix under the fitted model. Can be "I"(default), "shrink", or "R". See Details.
pairwise.comp: A character or factor vector specifying the levels for which a pairwise comparison will be carried out, adjusting for multiple comparisons via a free stepdown resampling procedure. Alternatively, a onesided formula specifying an interaction between factor levels.
block: a factor specifying the sampling level to be resampled. Default is resampling rows.
show.time: Whether to display timing information for the resampling procedure: "none" shows none, "all" shows all timing information and "total" shows only the overall time taken for the tests.
show.warning: logical. Whether to display warning messages in the operation procedure.
rep.seed: logical. Whether to fix random seed in resampling data. Useful for simulation or diagnostic purposes.
bootID: an integer matrix where each row specifies bootstrap id's in each resampling run. When bootID is supplied, nBoot is set to the number of rows in bootID. Default is NULL.
keep.boot: logical. Whether to return the bootstrapped test statistics.
x: an object of class "anova.manyglm", usually, a result of a call to anova.manyglm.

Warning

The comparison between two or more models by anova.manyglm will only be valid if they are fitted to the same dataset. This may be a problem if there are missing values and R's default of na.action = na.omit is used.

Author

Yi Wang, Ulrike Naumann, John Wilshire and David Warton <David.Warton@unsw.edu.au>.

Details

The anova.manyglm function returns a table summarising the statistical significance of a fitted manyglm model (Warton 2011), or of the differences between several nested models. If one model is specified, sequential test statistics (and P values) are returned for that fit. If more than one object is specified, the table contains test statistics (and P values) comparing their fits, provided that the models are fitted to the same dataset.

The test statistics are determined by the argument test, and the P-values are calculated by resampling rows of the data using a method determined by the argument resamp. resamp. Two of the three available resampling methods (residual permutation and parametric bootstrap) are described in more detail in Davison and Hinkley (1997, chapter 6), whereas the default (the ``PIT-trap'', Warton et al 2017) bootstraps probability integral transform residuals, which we have found to give the most reliable Type I error rates. All methods involve resampling under the resampling under the null hypothesis. These methods ensure approximately valid inference even when the mean-variance relationship or the correlation between variables has been misspecified. Standardized Pearson residuals (see manyglm are currently used in residual permutation, and where necessary, resampled response values are truncated so that they fall in the required range (e.g. counts cannot be negative). However, this can introduce bias, especially for family=binomial, so we advise extreme caution using perm.resid for presence/absence data. If resamp="none", p-values cannot be calculated, however the test statistics are returned.

If you do not have a specific hypothesis of primary interest that you want to test, and are instead interested in which model terms are statistically significant, then the summary.manyglm function is more appropriate. Whereas summary.manyglm tests the significance of each explanatory variable, anova.manyglm, given one manyglm object tests each term of the formula, e.g. if the formula is 'y~a+b' then a and b, that can be vectors or matrices, are tested for significance.

For information on the different types of data that can be modelled using manyglm, see manyglm. To check model assumptions, use plot.manyglm.

Multivariate test statistics are constructed using one of three methods: a log-likelihood ratio statistic test="LR", for example as in Warton et. al. (2012) or a Wald statistic test="wald" or a Score statistic test="score". "LR" has good properties, but is only available when cor.type="I".

The default Wald test statistic makes use of a generalised estimating equations (GEE) approach, estimating the covariance matrix of parameter estimates using a sandwich-type estimator that assumes the mean-variance relationship in the data is correctly specified and that there is an unknown but constant correlation across all observations. Such assumptions allow the test statistic to account for correlation between variables but to do so in a more efficient way than traditional GEE sandwich estimators (Warton 2011). The common correlation matrix is estimated from standardized Pearson residuals, and the method specified by cor.type is used to adjust for high dimensionality.

The Wald statistic has problems for count data and presence-absence data when there are estimated means at zero (which usually means very large parameter estimates, check for this using coef). In such instances Wald statistics should not be used, Score or LR should do the job.

The anova.manyglm function is designed specifically for high-dimensional data (that, is when the number of variables p is not small compared to the number of observations N). In such instances a correlation matrix is computationally intensive to estimate and is numerically unstable, so by default the test statistic is calculated assuming independence of variables (cor.type="I"). Note however that the resampling scheme used ensures that the P-values are approximately correct even when the independence assumption is not satisfied. However if it is computationally feasible for your dataset, it is recommended that you use cor.type="shrink" to account for correlation between variables, or cor.type="R" when p is small. The cor.type="R" option uses the unstructured correlation matrix (only possible when N>p), such that the standard classical multivariate test statistics are obtained. Note however that such statistics are typically numerically unstable and have low power when p is not small compared to N.

The cor.type="shrink" option applies ridge regularisation (Warton 2008), shrinking the sample correlation matrix towards the identity, which improves its stability when p is not small compared to N. This provides a compromise between "R" and "I", allowing us to account for correlation between variables, while using a numerically stable test statistic that has good properties.

The shrinkage parameter is an attribute of a manyglm object. For a Wald test, the sample correlation matrix of the alternative model is used to calculate the test statistics. So shrink.param of the alternative model is used. For a score test, the sample correlation matrix of the null model is used to calculate the test statistics. So shrink.param of the null model is used instead. If cor.type=="shrink" and shrink.param is NULL, then the shrinkage parameter will be estimated by cross-validation using the multivariate normal likelihood function (see ridgeParamEst and (Warton 2008)) for the corresponding model in the anova test.

Rather than stopping after testing for multivariate effects, it is often of interest to find out which response variables express significant effects. Univariate statistics are required to answer this question, and these are reported if requested. Setting p.uni="unadjusted" returns resampling-based univariate P-values for all effects as well as the multivariate P-values, whereas p.uni="adjusted" returns adjusted P-values (that have been adjusted for multiple testing), calculated using a step-down resampling algorithm as in Westfall & Young (1993, Algorithm 2.8). This method provides strong control of family-wise error rates, and makes use of resampling (using the method controlled by resamp) to ensure inferences take into account correlation between variables. This functionality is not currently available for models of relative abundance via composition=TRUE.

References

Davison, A. C. and Hinkley, D. V. (1997) Bootstrap Methods and their Application. Cambridge University Press, Cambridge.

Warton D.I. (2011). Regularized sandwich estimators for analysis of high dimensional data using generalized estimating equations. Biometrics, 67(1), 116-123.

Warton D.I. (2008). Penalized normal likelihood and ridge regularization of correlation and covariance matrices. Journal of the American Statistical Association 103, 340-349.

Warton D. I., Wright S., and Wang, Y. (2012). Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3(1), 89-101.

Warton D. I., Thibaut L., Wang Y. A. (2017). The PIT-trap - A "model-free" bootstrap procedure for inference about regression models with discrete, multivariate responses. PLoS One, 12(7), e0181790.

Westfall, P. H. and Young, S. S. (1993) Resampling-based multiple testing. John Wiley & Sons, New York.

Wu, C. F. J. (1986) Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis. The Annals of Statistics 14:4, 1261-1295.

Examples

Run this code

## Load the Tasmania data set
data(Tasmania)

## Visualise the effect of treatment on copepod abundance
tasm.cop <- mvabund(Tasmania$copepods)
treatment <- Tasmania$treatment
block <- Tasmania$block
#plot(tasm.cop ~ treatment, col=as.numeric(block))

## Fitting predictive models using a negative binomial model for counts:
tasm.cop.nb <- manyglm(tasm.cop ~ block*treatment, family="negative.binomial")

## Testing hypotheses about the treatment effect and treatment-by-block interactions, 
## using a Wald statistic and 199 resamples (better to ramp up to 999 for a paper):
anova(tasm.cop.nb, nBoot=199, test="wald")

## Performing the Pairwise comparison:
if (FALSE) {
data(solberg)
manyglm(abund ~ x, data=solberg) -> msolglm
## pairwise comparison on solberg$x
anova(msolglm, pairwise.comp = solberg$x, nBoot = 199)
# Could also run: anova(msolglm, pairwise.comp = ~treatment, nBoot = 199)
}

Run the code above in your browser using DataLab