This function computes some similarity or dissimilarity measures between marginal (joint) distribution of categorical variables(s).
The following measures are considered:
Dissimilarity index or total variation distance:
$$\Delta_{12} = \frac{1}{2} \sum_{j=1}^J \left| p_{1,j} - p_{2,j} \right|$$
where \(p_{s,j}\) are the relative frequencies (\(0 \leq p_{s,j} \leq 1\)). The dissimilarity index ranges from 0 (minimum dissimilarity) to 1. It can be interpreted as the smallest fraction of units that need to be reclassified in order to make the distributions equal. When p2
is the reference distribution (true or expected distribution under a given hypothesis) than, following the Agresti's rule of thumb (Agresti 2002, pp. 329--330) , values of \(\Delta_{12} \leq 0.03\) denotes that the estimated distribution p1
follows the true or expected pattern quite closely.
Overlap between two distributions:
$$O_{12} = \sum_{j=1}^J min(p_{1,j},p_{2,j}) $$
It is a measure of similarity which ranges from 0 to 1 (the distributions are equal). It is worth noting that \(O_{12}=1-\Delta_{12}\).
Bhattacharyya coefficient:
$$B_{12} = \sum_{j=1}^J \sqrt{p_{1,j} \times p_{2,j}} $$
It is a measure of similarity and ranges from 0 to 1 (the distributions are equal).
Hellinger's distance:
$$d_{H,12} = \sqrt{1-B_{12}} $$
It is a dissimilarity measure ranging from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure (\(0 \leq d_{H,12} \leq 1\); symmetry and triangle inequality).
Hellinger's distance is related to the dissimilarity index, and it is possible to show that:
$$d_{H,12}^2 \leq \Delta_{12} \leq d_{H,12}\sqrt{2} $$
Alongside with those similarity/dissimilarity measures the Pearson's Chi-squared is computed. Two formulas are considered. When p2
is the reference distribution (true or expected under some hypothesis, ref=TRUE
):
$$ \chi^2_P = n_1 \sum_{j=1}^J \frac{\left( p_1,j - p_{2,j}\right)^2}{p_{2,j}} $$
When p2
is a distribution estimated on a second sample then:
$$ \chi^2_P = \sum_{i=1}^2 \sum_{j=1}^J n_i \frac{\left( p_{i,j} - p_{+,j}\right)^2}{p_{+,j}} $$
where \(p_{+,j}\) is the expected frequency for category j, obtained as follows:
$$ p_{+,j} = \frac{n_1 p_{1,j} + n_2 p_{2,j}}{n_1+n_2} $$
being \(n_1\) and \(n_2\) the sizes of the samples.
The Chi-Square value can be used to test the hypothesis that two distributions are equal (\(df=J-1\)). Unfortunately such a test would not be useful when the distribution are estimated from samples selected from a finite population using complex selection schemes (stratification, clustering, etc.). In such a case different alternative corrected Chi-square tests are available (cf. Sarndal et al., 1992, Sec. 13.5). One possibility consist in dividing the Pearson's Chi-square test by the generalised design effect of both the surveys. Its estimation is not straightforward (sampling design variables need to be available). Generally speacking, the generalised design effect is smaller than 1 in the presence of stratified random sampling designs, while it exceeds 1 the presence of a two stage cluster sampling design. For the purposes of analysis it is reported the value of the generalised design effect g that would determine the acceptance of the null hypothesis (equality of distributions) in the case of \(\alpha = 0.05\) (\(df = J-1\)), i.e. values of g such that
$$ \frac{\chi^2_P}{g} \leq \chi^2_{J-1,0.05} $$