comp.cont: Compares two distributions of the same continuous variable

Description

This function estimate the “closeness” of the distributions of the same continuous variable(s) but estimated from different data sources.

Usage

comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, 
          w.B = NULL, ref = FALSE)

Value

A list object with four components.

summary: A matrix with summaries of xlab.A estimated on data.A and summaries of xlab.B estimated on data.B
diff.Qs: Average of absolute and squared differences between the quantiles of xlab.A estimated on data.A and the corresponding ones of xlab.B estimated on data.B
dist.ecdf: Dissimilarity measures between the estimated empirical cumulative distribution functions.
dist.discr: Distance between the distributions after discretization of the target variable.

Arguments

data.A: A dataframe or matrix containing the variable of interest xlab.A and eventual associated survey weights w.A.
data.B: A dataframe or matrix containing the variable of interest xlab.B and eventual associated survey weights w.B.
xlab.A: Character string providing the name of the variable in data.A whose estimated distribution should be compared with that estimated from data.B.
xlab.B: Character string providing the name of the variable in data.B whose distribution should be compared with that estimated from data.A. If xlab.B=NULL (default) then it assumed xlab.B=xlab.A.
w.A: Character string providing the name of the optional weighting variable in data.A that, in case, should be used to estimate the distribution of xlab.A
w.B: Character string providing the name of the optional weighting variable in data.B that, in case, should be used to estimate the distribution of xlab.B
ref: Logical. When ref = TRUE, the distribution of xlab.B estimated from data.B is considered the reference distribution (true or reliable estimate of distribution).

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

This function calculates well--known summary measures (min, Q1, median, mean, Q3, max and sd) estimated from the available data. It also compares the quantiles estimated from data.A with those estimated from data.B and returns the average of the absolute value of the differences and the average of the squared differences. Note that the number of percentiles estimated depends on the minimum between the two sample sizes. Note that the number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated if min(n.A, n.B)<=50; quintiles are estimated if min(n.A, n.B)>50 and min(n.A, n.B)<=150; deciles are estimated if min(n. A, n.B)>150 and min(n.A, n.B)<=250; finally, quantiles for probs=seq(from = 0.05,to = 0.95,by = 0.05) are estimated when min(n.A, n.B)>250. If survey weights are available (indicated by w.A and/or w.B), they are used to estimate the quantiles by calling the function wtd.quantile in the package Hmisc.

The dissimilarities between the estimated empirical distribution functions are calculated. The measures considered are the maximum value of the differences, the sum of the absolute values of the minimum and maximum, and the average of the absolute differences. If weights are given, they are used in the estimation of the empirical cumulative distribution function. Note that when ref=TRUE is given, the estimation of the density and the empirical cumulative distribution will be guided by the data in data.B.

Finally, the total variation distance, the overlap and the Hellinger are calculated on the transformed categorised variable. Note that the breaks to categorise the variable are decided according to the Freedman-Diaconis rule (nclass) and, in this case, with ref=TRUE the IQR is estimated on data.B alone, whereas with ref=FALSE it is estimated by combining the two data sources. If present, the weights are used to estimate the relative frequencies of the categorised variable. total variation distance:

$$\Delta_{AB} = \frac{1}{2} \sum_{j=1}^J \left| p_{A,j} - p_{B,j} \right|$$

where $p_{s,j}$ are the relative frequencies ($0 \leq p_{s,j} \leq 1$). The dissimilarity index ranges from 0 (minimum dissimilarity) to 1. The total variation distance comes along with its complement to 1, said “overlap” between distributions.

the Hellinger's distance:

$$d_{H,AB} = \sqrt{ \frac{1}{2} \sum_{j=1}^J \left( \sqrt{p_{A,j}} - \sqrt{p_{B,j}} \right)^2 } $$

It is a dissimilarity measure ranging from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure ($0 \leq d_{H,AB} \leq 1$; symmetry and triangle inequality). Hellinger's distance is related to the total variation distance, and it is possible to show that:

$$d_{H,AB}^2 \leq \Delta_{AB} \leq d_{H,AB}\sqrt{2} $$

References

Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407--424.