This function estimates the “closeness” of distributions of the same continuous variable(s) but estimated from different data sources.
comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, w.B = NULL, ref = FALSE)
A list
object with four components.
A matrix with summaries of xlab.A
estimated on data.A
and summaries of xlab.B
estimated on data.B
Average of absolute and squared differences between the quantiles of xlab.A
estimated on data.A
and the corresponding ones of xlab.B
estimated on data.B
Dissimilarity measures between the estimated empirical cumulative distribution functions.
Distance between the distributions after discretization of the target variable.
A dataframe or matrix containing the variable of interest xlab.A
and eventual survey weights w.A
.
A dataframe or matrix containing the variable of interest xlab.B
and eventual associated survey weights w.B
.
Character string providing the name of the variable in data.A
whose estimated distribution should be compared with that estimated from data.B
.
Character string providing the name of the variable in data.B
whose distribution should be compared with that estimated from data.A
. If xlab.B=NULL
(default) then it assumed xlab.B=xlab.A
.
Character string providing the name of the optional weighting variable in data.A
that, in case, should be used to estimate the distribution of xlab.A
Character string providing the name of the optional weighting variable in data.B
that, in case, should be used to estimate the distribution of xlab.B
Logical. When ref = TRUE
, the distribution of xlab.B
estimated from data.B
is considered the reference distribution (true or reliable estimate of distribution). Affects some estimation procedures as explained in the Details.
Marcello D'Orazio mdo.statmatch@gmail.com
As a first output, the function returns some well--known summary measures (min, Q1, median, mean, Q3, max and sd) estimated from the available input data sources.
Secondly this function performs a comparison between the quantiles estimated from data.A
and data.B
; in particular, the average of the absolute value of the differences as well as the average of the squared differences are returned. The number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated when min(n.A, n.B)<=50; quintiles are estimated when min(n.A, n.B)>50 and min(n.A, n.B)<=150; deciles are estimated when min(n.A, n.B)>150 and min(n.A, n.B)<=250; finally quantiles for probs=seq(from = 0.05,to = 0.95,by = 0.05)
are estimated when min(n.A, n.B)>250. When the survey weights are available (indicated with th arguments w.A
and/or w.B
) they are used in estimating the quantiles by calling the function wtd.quantile
in the package Hmisc.
The function estimates also the dissimilarities between the estimated empirical distribution function. The measures considered are the maximum of the absolute differences, the sum between the maximum differences inverting the terms in the difference and the average of the absolute value of the differences. When the weights are provided they are used in estimating the empirical cumulative distribution function. Note that when ref=TRUE
the estimation of the density and of the empirical cumulative distribution are guided by the data in data.B
.
The final output is the total variation distance, the overlap and the Hellinger distance calculated considering the transformed categorized variable. The breaks to categorize the variable are decided according to the Freedman-Diaconis rule (nclass
) and, in this case, when ref=TRUE
the IQR is estimated solely on data.B
, whereas with ref=FALSE
it is estimated by joining the two data sources.
When present, the weights are used in estimating the relative frequencies of the categorized variable.
For additional details on these distances please see (comp.prop
)
Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407--424.
plotCont
, comp.prop
data(samp.A)
data(samp.B)
comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age")
comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age",
w.A = "ww", w.B = "ww")
Run the code above in your browser using DataLab