permutations: Permutation tests in Vegan

Description

From version 2.2-0, vegan has significantly improved access to restricted permutations which brings it into line with those offered by Canoco. The permutation designs are modelled after the permutation schemes of Canoco 3.1 (ter Braak, 1990).

vegan currently provides for the following features within permutation tests:

Free permutation of DATA, also known as randomisation,
Free permutation of DATA within the levels of a grouping variable,
Restricted permutations for line transects or time series,
Permutation of groups of samples whilst retaining the within-group ordering,
Restricted permutations for spatial grids,
Blocking, samples are never permuted between blocks, and
Split-plot designs, with permutation of whole plots, split plots, or both.

Above, we use DATA to mean either the observed data themselves or some function of the data, for example the residuals of an ordination model in the presence of covariables.

These capabilities are provided by functions from the permute package. The user can request a particular type of permutation by supplying the permutations argument of a function with an object returned by how, which defines how samples should be permuted. Alternatively, the user can simply specify the required number of permutations and a simple randomisation procedure will be performed. Finally, the user can supply a matrix of permutations (with number of rows equal to the number of permutations and number of columns equal to the number of observations in the data) and vegan will use these permutations instead of generating new permutations.

The majority of functions in vegan allow for the full range of possibilities outlined above. Exceptions include kendall.post and kendall.global.

The Null hypothesis for the first two types of permutation test listed above assumes free exchangeability of DATA (within the levels of the grouping variable, if specified). Dependence between observations, such as that which arises due to spatial or temporal autocorrelation, or more-complicated experimental designs, such as split-plot designs, violates this fundamental assumption of the test and requires more complex restricted permutation test designs. It is these designs that are available via the permute package and to which vegan provides access from version 2.2-0 onwards.

Unless otherwise stated in the help pages for specific functions, permutation tests in vegan all follow the same format/structure:

An appropriate test statistic is chosen. Which statistic is chosen should be described on the help pages for individual functions.
The value of the test statistic is evaluate for the observed data and analysis/model and recorded. Denote this value $x_0$.
The DATA are randomly permuted according to one of the above schemes, and the value of the test statistic for this permutation is evaluated and recorded.
Step 3 is repeated a total of $n$ times, where $n$ is the number of permutations requested. Denote these values as $x_i$, where $i = 1, ..., n$
Count the number of values of the test statistic, $x_i$, in the Null distribution that are as extreme as test statistic for the observed data $x_0$. Denote this count as $N$. We use the phrase as extreme to include cases where a two-sided test is performed and large negative values of the test statistic should be considered.
The permutation p-value is computed as $$p = \frac{N + 1}{n + 1}$$

The above description illustrates why the default number of permutations specified in vegan functions takes values of 199 or 999 for example. Pretty p values are achieved because the $+ 1$ in the denominator results in division by 200 or 1000, for the 199 or 999 random permutations used in the test.

The simple intuition behind the presence of $+ 1$ in the numerator and denominator is that these represent the inclusion of the observed value of the statistic in the Null distribution (e.g. Manly 2006). Phipson & Smyth (2010) present a more compelling explanation for the inclusion of $+ 1$ in the numerator and denominator of the p value calculation.

Fisher (1935) had in mind that a permutation test would involve enumeration of all possible permutations of the data yielding an exact test. However, doing this complete enumeration may not be feasible in practice owing to the potentially vast number of arrangements of the data, even in modestly-sized data sets with free permutation of samples. As a result we evaluate the p value as the tail probability of the Null distribution of the test statistic directly from the random sample of possible permutations. Phipson & Smyth (2010) show that the naive calculation of the permutation p value is

$$p = \frac{N}{n}$$

which leads to an invalid test with incorrect type I error rate. They go on to show that by replacing the unknown tail probability (the p value) of the Null distribution with the biased estimator

$$p = \frac{N + 1}{n + 1}$$

that the positive bias induced is of just the right size to account for the uncertainty in the estimation of the tail probability from the set of randomly sampled permutations to yield a test with the correct type I error rate.

The estimator described above is correct for the situation where permutations of the data are samples randomly without replacement. This is not strictly what happens in vegan because permutations are drawn pseudo-randomly independent of one another. Note that the actual chance of this happening is practice is small but the functions in permute do not guarantee to generate a unique set of permutations unless complete enumeration of permutations is requested. This is not feasible for all but the smallest of data sets or restrictive of permutation designs, but in such cases the chance of drawing a set of permutations with repeats is lessened as the sample size, and thence the size of set of all possible permutations, increases.

Under the situation of sampling permutations with replacement then, the tail probability $p$ calculated from the biased estimator described above is somewhat conservative, being too large by an amount that depends on the number of possible values that the test statistic can take under permutation of the data (Phipson & Smyth, 2010). This represents a slight loss of statistical power for the conservative p value calculation used here. However, unless sample sizes are small and the the permutation design such that the set of values that the test statistic can take is also small, this loss of power is unlikely to be critical.

The minimum achievable p-value is

$$p_{\mathrm{min}} = \frac{1}{n + 1}$$

and hence depends on the number of permutations evaluated. However, one cannot simply increase the number of permutations ($n$) to achieve a potentially lower p-value unless the number of observations available permits such a number of permutations. This is unlikely to be a problem for all but the smallest data sets when free permutation (randomisation) is valid, but in restricted permutation designs with a low number of observations, there may not be as many unique permutations of the data as you might desire to reach the required level of significance.

It is currently the responsibility of the user to determine the total number of possible permutations for their DATA. The number of possible permutations allowed under the specified design can be calculated using numPerms from the permute package. Heuristics employed within the shuffleSet function used by vegan can be triggered to generate the entire set of permutations instead of a random set. The settings controlling the triggering of the complete enumeration step are contained within a permutation design created using link[permute]{how} and can be set by the user. See how for details.

Limits on the total number of permutations of DATA are more severe in temporally or spatially ordered data or experimental designs with low replication. For example, a time series of $n = 100$ observations has just 100 possible permutations including the observed ordering.

In situations where only a low number of permutations is possible due to the nature of DATA or the experimental design, enumeration of all permutations becomes important and achievable computationally.

Above, we have provided only a brief overview of the capabilities of vegan and permute. To get the best out of the new functionality and for details on how to set up permutation designs using how, consult the vignette Restricted permutations; using the permute package supplied with permute and accessible via vignette("permutations", package = "permute").

Arguments

Random Number Generation

The permutations are based on the random number generator provided by R. This may change in R releases and change the permutations and vegan test results. One such change was in R release 3.6.0. The new version is clearly better for permutation tests and you should use it. However, if you need to reproduce old results, you can set the R random number generator to a previous version with RNGversion.

Author

Gavin L. Simpson

References

Manly, B. F. J. (2006). Randomization, Bootstrap and Monte Carlo Methods in Biology, Third Edition. Chapman and Hall/CRC.

Phipson, B., & Smyth, G. K. (2010). Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 9, Article 39. DOI: 10.2202/1544-6115.1585

ter Braak, C. J. F. (1990). Update notes: CANOCO version 3.1. Wageningen: Agricultural Mathematics Group. (UR).

Description

Arguments

Random Number Generation

Author

References

See Also