A vector containing the pairwise two-sample multivariate
\(\mathcal{E}\)-statistics for comparing clusters or samples is returned.
The e-distance between clusters is computed from the original pooled data,
stacked in matrix x
where each row is a multivariate observation, or
from the distance matrix x
of the original data, or distance object
returned by dist
. The first sizes[1]
rows of the original data
matrix are the first sample, the next sizes[2]
rows are the second
sample, etc. The permutation vector ix
may be used to obtain
e-distances corresponding to a clustering solution at a given level in
the hierarchy.
The default method cluster
summarizes the e-distances between
clusters in a table.
The e-distance between two clusters \(C_i, C_j\)
of size \(n_i, n_j\)
proposed by Szekely and Rizzo (2005)
is the e-distance \(e(C_i,C_j)\), defined by
$$e(C_i,C_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}],
$$
where
$$M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j}
\|X_{ip}-X_{jq}\|^\alpha,$$
\(\|\cdot\|\) denotes Euclidean norm, \(\alpha=\)
alpha
, and \(X_{ip}\) denotes the p-th observation in the i-th cluster. The
exponent alpha
should be in the interval (0,2].
The coefficient \(\frac{n_i n_j}{n_i+n_j}\)
is one-half of the harmonic mean of the sample sizes. The
discoB
method is related but with
different ways of summarizing the pairwise differences between samples.
The disco
methods apply the coefficient
\(\frac{n_i n_j}{2N}\) where N is the total number
of observations. This weights each (i,j) statistic by sample size
relative to N. See the disco
topic for more details.