dist.ldc: Dissimilarity matrices for community composition data

Description

Compute dissimilarity indices for ecological data matrices. The dissimilarity indices computed by this function are those described in Legendre and De Cáceres (2013). In the name of the function, 'ldc' stands for the author's names. Twelve of these 21 indices are not readily available in other R package functions; four of them can, however, be computed in two computation steps in vegan.

Usage

dist.ldc(Y, method = "hellinger", binary = FALSE, samp = TRUE, silent = FALSE)

Value

A dissimilarity matrix, with class dist.

Arguments

Y: Community composition data. The object class can be either data.frame or matrix.
method: One of the 21 dissimilarity coefficients available in the function: "hellinger", "chord", "log.chord", "chisquare", "profiles", "percentdiff", "ruzicka", "divergence", "canberra", "whittaker", "wishart", "kulczynski", "jaccard", "sorensen", "ochiai", "ab.jaccard", "ab.sorensen", "ab.ochiai", "ab.simpson", "euclidean", "manhattan", "modmeanchardiff". See Details. Names can be abbreviated to a non-ambiguous set of first letters. Default: method="hellinger".
binary: If binary=TRUE, the data are transformed to presence-absence form before computation of the dissimilarities. Default value: binary=FALSE, except for the Jaccard, Sørensen and Ochiai indices where binary=TRUE.
samp: If samp=TRUE, the abundance-based distances (ab.jaccard, ab.sorensen, ab.ochiai, ab.simpson) are computed for sample data. If samp=FALSE, binary indices are computed for true population data.
silent: If silent=FALSE, informative messages sent to users will be printed to the R console. Use silent=TRUE is called on a numerical simulation loop, for example.

Author

Pierre Legendre pierre.legendre@umontreal.ca and Naima Madi

Details

The dissimilarities computed by this function are the following. Indices i and k designate two rows (sites) of matrix Y, j designates a column (species). D[ik] is the dissimilarity between rows i and k. p is the number of columns (species) in Y; pp is the number of species present in one or the other site, or in both. y[i+] is the sum of values in row i; same for y[k+]. y[+j] is the sum of values in column j. y[++] is the total sum of values in Y. The indices are computed by functions written in C for greater computation speed with large data matrices.

Group 1 - D computed by transformation of Y followed by Euclidean distance
- Hellinger D, D[ik] = sqrt(sum((sqrt(y[ij]/y[i+])-sqrt(y[kj]/y[k+]))^2))
- chord D, D[ik] = sqrt(sum((y[ij]/sqrt(sum(y[ij]^2))-y[kj]/sqrt(sum(y[kj]^2)))^2))
- log-chord D, D[ik] = chord D[ik] computed on log(y[ij]+1)-transformed data (Legendre and Borcard 2018)
- chi-square D, D[ik] = sqrt(y[++] sum((1/j[+j])(y[ij]/y[i+]-y[kj]/y[k+])^2))
- species profiles D, D[ik] = sqrt(sum((y[ij]/y[i+]-y[kj]/y[k+])^2))
Group 2 - Other D functions appropriate for beta diversity studies where A = sum(min(y[ij],y[kj])), B = y[i+]-A, C = y[k+]-A
- Ružička D, D[ik] = 1-(sum(min(y[ij],y[kj])/sum(max(y[ij],y[kj])) or else, D[ik] = (B+C)/(A+B+C)
- coeff. of divergence D, D[ik] = sqrt((1/pp)sum(((y[ij]-y(kj])/(y[ij]+y(kj]))^2))
- Canberra metric D, D[ik] = (1/pp)sum(abs(y[ij]-y(kj])/(y[ij]+y(kj]))
- Whittaker D, D[ik] = 0.5*sum(abs(y[ij]/y[i+]-y(kj]/y[k+]))
- Wishart D, D[ik] = 1-sum(y[ij]y[kj])/(sum(y[ij]^2)+sum(y[kj]^2)-sum(y[ij]y[kj]))
- Kulczynski D, D[ik] = 1-0.5((sum(min(y[ij],y[kj])/y[i+]+sum(min(y[ij],y[kj])/y[k+]))
Group 3 - Classical indices for binary data; they are appropriate for beta diversity studies. Value a is the number of species found in both i and k, b is the number of species in site i not found in k, and c is the number of species found in site k but not in i. The D matrices are square-root transformed, as in dist.binary of ade4; the user-oriented reason for this transformation is explained below.
- Sørensen D, D[ik] = sqrt((b+c)/(2a+b+c))
- Ochiai D, D[ik] = sqrt(1 - a/sqrt((a+b)(a+c)))
Group 4 - Abundance-based indices of Chao et al. (2006) for quantitative abundance data. These functions correct the index for species that have not been observed due to sampling errors. For the meaning of the U and V notations, see Chao et al. (2006, section 3). When samp=TRUE, the abundance-based distances (ab.jaccard, ab.sorensen, ab.ochiai, ab.simpson) are computed for sample data. If samp=FALSE, indices are computed for true population data. - Do not use indices of group 4 with samp=TRUE on presence-absence data; the indices are not meant to accommodate this type of data. If samp=FALSE is used with presence-absence data, the indices are the regular Jaccard, Sørensen, Ochiai and Simpson indices. On output, however, the D matrices are not square-rooted, contrary to the Jaccard, Sørensen and Ochiai indices in section 3 which are square-rooted.
- abundance-based Sørensen D, D[ik] = 1-(2UV/(U+V))
- abundance-based Ochiai D, D[ik] = 1-sqrt(UV)
- abundance-based Simpson D, D[ik] = 1-(UV/(UV+min((U-UV),(V-UV))))
Group 5 - General-purpose dissimilarities that do not have an upper bound (maximum D value). They are inappropriate for beta diversity studies.
- Manhattan D, D[ik] = sum(abs(y[ij] - y[ik]))
- modified mean character difference, D[ik] = (1/pp) sum(abs(y[ij] - y[ik]))

The properties of all dissimilarities available in this function (except Ružička D) were described and compared in Legendre & De Cáceres (2013), who showed that most of these dissimilarities are appropriate for beta diversity studies. Inappropriate are the Euclidean, Manhattan, modified mean character difference, species profile and chi-square distances. Most of these dissimilarities have a maximum value of either 1 or sqrt(2). Three dissimilarities (Euclidean, Manhattan, Modified mean character difference) do not have an upper bound and are thus inappropriate for beta diversity studies. The chi-square distance has an upper bound of sqrt(2*(sum(Y))).

The Euclidean, Hellinger, chord, chi-square and species profiles dissimilarities have the property of being Euclidean, meaning that they never produce negative eigenvalues in principal coordinate analysis. The Canberra, Whittaker, percentage difference, Wishart and Manhattan coefficients are Euclidean when they are square-root transformed (Legendre & De Cáceres 2013, Table 2). The distance forms (1-S) of the Jaccard, Sørensen and Ochiai similarity (S) coefficients are Euclidean after taking the square root of (1-S) (Legendre & Legendre 2012, Table 7.2). The D matrices resulting from these three coefficients are outputted in the form sqrt(1-S), as in function dist.binary of ade4, because that form is Euclidean and will thus produce no negative eigenvalues in principal coordinate analysis.

The Hellinger, chord, chi-square and species profile dissimilarities are computed using the two-step procedure developed by Legendre & Gallagher (2001). The data are first transformed using either the row marginals, or the row and column marginals in the case of the chi-square distance. The dissimilarities are then computed from the transformed data using the Euclidean distance formula. As a consequence, these four dissimilarities are necessarily Euclidean. D matrices for other binary coefficients can be computed in two ways: either by using function dist.binary of ade4, or by choosing option binary=TRUE, which transforms the abundance data to binary form, and using one of the quantitative indices of the present function. Table 1 of Legendre & De Cáceres (2013) shows the incidence-based (presence-absence-based) indices computed by the various indices using binary data.

The Euclidean distance computed on untransformed presence-absence or abundance data produces non-informative and incorrect ordinations, as shown in Legendre & Legendre (2012, p. 300) and in Legendre & De Cáceres (2013). However, the Euclidean distance computed on log-transformed abundance data produces meaningful ordinations in principal coordinate analysis (PCoA). Nonetheless, it is easier to compute a PCA of log-transformed abundance data instead of a PCoA; the resulting ordination with scaling 1 will be meaningful. Messages are printed to the R console indicating the Euclidean status of the computed dissimilarity matrices. Note that for the chi-square distance, the columns that sum to zero are eliminated before calculation of the distances, thus preventing divisions by zero in the calculation of the chi-square transformation.

References

Chao, A., R. L. Chazdon, R. K. Colwell and T. J. Shen. 2006. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics 62: 361-371.

Legendre, P. and D. Borcard. 2018. Box-Cox-chord transformations for community composition data prior to beta diversity analysis. Ecography 41: 1820-1824.

Legendre, P. and M. De Cáceres. 2013. Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters 16: 951-963.

Legendre, P. and E. D. Gallagher, E.D. 2001. Ecologically meaningful transformations for ordination of species data. Oecologia 129: 271-280.

Legendre, P. and Legendre, L. 2012. Numerical Ecology. 3rd English edition. Elsevier Science BV, Amsterdam.

Examples

Run this code


if(require("vegan", quietly = TRUE)) {
data(mite)
mat1  = as.matrix(mite[1:10, 1:15])   # No column has a sum of 0
mat2 = as.matrix(mite[61:70, 1:15])   # 7 of the 15 columns have a sum of 0

#Example 1: compute Hellinger distance for mat1
D.out = dist.ldc(mat1,"hellinger")

#Example 2: compute chi-square distance for mat2
D.out = dist.ldc(mat2,"chisquare")

#Example 3: compute percentage difference dissimilarity for mat2
D.out = dist.ldc(mat2,"percentdiff")

}

Run the code above in your browser using DataLab