Learn R Programming

StatMatch (version 1.4.2)

comp.cont: Empirical comparison of two estimated distributions of the same continuous variable

Description

This function estimates the “closeness” of distributions of the same continuous variable(s) but estimated from different data sources.

Usage

comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, w.B = NULL, ref = FALSE)

Value

A list object with four components.

summary

A matrix with summaries of xlab.A estimated on data.A and summaries of xlab.B estimated on data.B

diff.Qs

Average of absolute and squared differences between the quantiles of xlab.A estimated on data.A and the corresponding ones of xlab.B estimated on data.B

dist.ecdf

Dissimilarity measures between the estimated empirical cumulative distribution functions.

dist.discr

Distance between the distributions after discretization of the target variable.

Arguments

data.A

A dataframe or matrix containing the variable of interest xlab.A and eventual survey weights w.A.

data.B

A dataframe or matrix containing the variable of interest xlab.B and eventual associated survey weights w.B.

xlab.A

Character string providing the name of the variable in data.A whose estimated distribution should be compared with that estimated from data.B.

xlab.B

Character string providing the name of the variable in data.B whose distribution should be compared with that estimated from data.A. If xlab.B=NULL (default) then it assumed xlab.B=xlab.A.

w.A

Character string providing the name of the optional weighting variable in data.A that, in case, should be used to estimate the distribution of xlab.A

w.B

Character string providing the name of the optional weighting variable in data.B that, in case, should be used to estimate the distribution of xlab.B

ref

Logical. When ref = TRUE, the distribution of xlab.B estimated from data.B is considered the reference distribution (true or reliable estimate of distribution). Affects some estimation procedures as explained in the Details.

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

As a first output, the function returns some well--known summary measures (min, Q1, median, mean, Q3, max and sd) estimated from the available input data sources.

Secondly this function performs a comparison between the quantiles estimated from data.A and data.B; in particular, the average of the absolute value of the differences as well as the average of the squared differences are returned. The number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated when min(n.A, n.B)<=50; quintiles are estimated when min(n.A, n.B)>50 and min(n.A, n.B)<=150; deciles are estimated when min(n.A, n.B)>150 and min(n.A, n.B)<=250; finally quantiles for probs=seq(from = 0.05,to = 0.95,by = 0.05) are estimated when min(n.A, n.B)>250. When the survey weights are available (indicated with th arguments w.A and/or w.B) they are used in estimating the quantiles by calling the function wtd.quantile in the package Hmisc.

The function estimates also the dissimilarities between the estimated empirical distribution function. The measures considered are the maximum of the absolute differences, the sum between the maximum differences inverting the terms in the difference and the average of the absolute value of the differences. When the weights are provided they are used in estimating the empirical cumulative distribution function. Note that when ref=TRUE the estimation of the density and of the empirical cumulative distribution are guided by the data in data.B.

The final output is the total variation distance, the overlap and the Hellinger distance calculated considering the transformed categorized variable. The breaks to categorize the variable are decided according to the Freedman-Diaconis rule (nclass) and, in this case, when ref=TRUE the IQR is estimated solely on data.B, whereas with ref=FALSE it is estimated by joining the two data sources. When present, the weights are used in estimating the relative frequencies of the categorized variable. For additional details on these distances please see (comp.prop)

References

Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407--424.

See Also

plotCont, comp.prop

Examples

Run this code
data(samp.A)
data(samp.B)

comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age")

comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age",
          w.A = "ww", w.B = "ww")

Run the code above in your browser using DataLab