Learn R Programming

StatMatch (version 1.2.0)

Fbwidths.by.x: Computes the Frechet bounds of cells in a contingency table by considering all the possible subsets of the common variables.

Description

This function permits to compute the bounds for cell probabilities in the contingency table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables, by considering all the possible subsets of the X variables. In this manner it is possible to identify which subset of the X variables produces the major reduction of the uncertainty.

Usage

Fbwidths.by.x(tab.x, tab.xy, tab.xz)

Arguments

tab.x
A Rtable crossing the X variables. This table must be obtained by using the function xtabs or table, e.g. tab.x <- xtabs(~x1+x2+x3, data=
tab.xy
A Rtable of X vs. Y variable. This table must be obtained by using the function xtabs or table, e.g. table.xy <- xtabs(~x1+x2+x3+y, data=
tab.xz
A Rtable of X vs. Z variable. This table must be obtained by using the function xtabs or table, e.g. tab.xz <- xtabs(~x1+x2+x3+z, data=da

Value

  • A list with the estimated estimated bounds for the cells in the table of Y vs. Z for each possible subset of the X variables. The final component sum.unc is a data.frame that summarizes the findings for each subset of the X variables and measures of the uncertainty are reported. In particular the data frame reports the no. of X variables ("x.vars"), the number of cells in the joint distribution of the X variables ("x.cells"), the number of cells in joint distribution of the X variables with frequency equal to 0 ("x.freq0"), the average widths of the uncertainty intervals ("av.widths") and finally the estimated overall uncertainty ("ov.unc") (estimated Delta).

Details

This function permits to compute the Frechet bounds for the frequencies in the contingency table of Y vs. Z, starting from the conditional distributions P(Y|X) and P(Z|X) (for details see Frechet.bounds.cat), by considering all the possible subsets of the X variables. In this manner it is possible to identify the subset of the X variables, with highest association with both Y and Z, that permits to reduce the uncertainty concerning the distribution of Y vs. Z. The overall uncertainty is measured by considering the suggestion in Conti et al. (2012): $$\hat{\Delta} = \sum_{i,j,k} ( p^{(up)}_{Y=j,Z=k} - p^{(low)}_{Y=j,Z=k} ) \times p_{Y=j|X=i} \times p_{Z=k|X=i} \times p_{X=i}$$ In addition, the average of the widths of the bounds for the cells in the table of Y vs. Z it is also reported: $$\bar{d} = \frac{1}{J \times K} \sum_{j,k} ( p^{(up)}_{Y=j,Z=k} - p^{(low)}_{Y=j,Z=k} )$$ For details see Frechet.bounds.cat.

References

Ballin, M., D'Orazio, M., Di Zio, M., Scanu, M. and Torelli, N. (2009) Statistical Matching of Two Surveys with a Common Subset. Working Paper, 124. Dip. Scienze Economiche e Statistiche, Univ. di Trieste, Trieste. Conti P.L, Marella, D., Scanu, M. (2012) Uncertainty Analysis in Statistical Matching. Journal of Official Statistics, 28, pp. 69--88. D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

See Also

Frechet.bounds.cat, harmonize.x

Examples

Run this code
data(quine, package="MASS") #loads quine from MASS
str(quine)
quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100))
table(quine$c.Days)


# split quine in two subsets
set.seed(4567)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:4]
quine.B <- quine[-lab.A, c(1:3,6)]

# compute the tables required by Fbwidths.by.x()
freq.x <- xtabs(~Eth+Sex+Age, data=quine.A)
freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A)
freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B)

# apply Fbwidths.by.x()
bounds.yz <- Fbwidths.by.x(tab.x=freq.x, tab.xy=freq.xy,
        tab.xz=freq.xz)

bounds.yz$sum.unc

# ordered according to "ov.unc"
bounds.yz$sum.unc[order(bounds.yz$sum.unc$ov.unc),]

# ordered according to average widths of intervals
bounds.yz$sum.unc[order(bounds.yz$sum.unc$av.width),]

Run the code above in your browser using DataLab