This function permits to compute the bounds for cell probabilities in the contingency table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables, by considering all the possible subsets of the X variables. In this manner it is possible to identify which subset of the X variables produces the major reduction of the average width of conditional bounds.
Fbwidths.by.x(tab.x, tab.xy, tab.xz, deal.sparse="discard",
nA=NULL, nB=NULL, ...)
A list with the estimated bounds for the cells in the table of Y vs. Z for each possible subset of the X variables. The final component in the list, sum.unc
, is a data.frame that summarizes the main results. In particular, it reports the number of X variables ("x.vars"
), the number of cells in each of the input tables and the cells with frequency equal to 0 (columns ending with freq0
). Moreover, it reported the value ("av.n"
) of the rule used to decide whether we are dealing with a sparse case (see Details) and the Cohen's effect size measured for the table crossing the considered combination of the X variables.
Finally, it is provided the average width of the uncertainty intervals ("av.width"
), the penalty terms g1 and g2 ("penalty1"
and "penalty2"
respectively), and the penalized average widths ("av.width.pen1"
and "av.width.pen2"
, where av.width.pen1=av.width+pen1 and av.width.pen2=av.width+pen2).
A R table crossing the X variables. This table must be obtained by using the function xtabs
or table
, e.g.
tab.x <- xtabs(~x1+x2+x3, data=data.all)
.
A R table of X vs. Y variable. This table must be obtained by using the function xtabs
or table
, e.g.
table.xy <- xtabs(~x1+x2+x3+y, data=data.A)
.
A single categorical Y variables is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in tab.x
must be available in tab.xy
. Moreover, it is assumed that the joint distribution of the X variables computed from tab.xy
is equal to tab.x
; a warning is produced if this is not true.
A R table of X vs. Z variable. This table must be obtained by using the function xtabs
or table
, e.g.
tab.xz <- xtabs(~x1+x2+x3+z, data=data.B)
.
A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in tab.x
must be available in tab.xz
. Moreover, it is assumed that the joint distribution of the X variables computed from tab.xz
is equal to tab.x
; a warning is produced if this is not true.
Text, how to estimate the cell relative frequencies when dealing with too sparse tables. When deal.sparse="discard"
(default) no estimation is performed if tab.xy
or tab.xz
is too sparse. When deal.sparse="relfreq"
the standard estimator (cell count divided by the sample size) is considered.
Note that here sparseness is measured by number of cells with respect to the sample size; sparse table are those where the number of cells exceeds the sample size (see Details).
Integer, sample size of file A used to estimate tab.xy
. If NULL
, it is obtained as sum of frequencies intab.xy
.
Integer, sample size of file B used to estimate tab.xz
. If NULL
, it is obtained as sum of frequencies intab.xz
.
Additional arguments that may be required when deriving an estimate of uncertainty by calling Frechet.bounds.cat
.
Marcello D'Orazio mdo.statmatch@gmail.com
This function permits to compute the Frechet bounds for the frequencies in the contingency table of Y vs. Z, starting from the conditional distributions P(Y|X) and P(Z|X) (for details see
Frechet.bounds.cat
), by considering all the possible subsets of the X variables. In this manner it is possible to identify the subset of the X variables, with highest association with both Y and Z, that permits to reduce the uncertainty concerning the distribution of Y vs. Z.
The uncertainty is measured by the average of the widths of the bounds for the cells in the table Y vs. Z:
$$ \bar{d} = \frac{1}{J \times K} \sum_{j,k} ( p^{(up)}_{Y=j,Z=k} - p^{(low)}_{Y=j,Z=k} )$$
For details see Frechet.bounds.cat
.
Provided that uncertainty, measured in terms of \(\bar{d}\), tends to reduce when conditioning on a higher number of X variables. Two penalties are introduced to account for the additional number of cells to be estimated when adding a X variable. The first penalty, introduced in D'Orazio et al. (2017), is:
$$g_1=log\left( 1 + \frac{H_{D_m}}{H_{D_Q}} \right) $$
Where \(H_{D_m}\) is the number of cell in the table obtained by crossing the given subset of X variables and the \(H_{D_Q}\) is the number of cell in the table achieved by crossing all the available X variables. A second penalty takes into account the number of cells to estimate with respect to the sample size (D'Orazio et al., 2019). It is obtained as:
$$g_2 = max \left[ \frac{1}{n_A - H_{D_m} \times J}, \frac{1}{n_B - H_{D_m} \times K} \right]$$
with \(n_A > H_{D_m} \times J\) and \(n_B > H_{D_m} \times K\). In practice, it is considered the number of cells to estimate compared to the sample size. This criterion is considered to measure sparseness too. In particular, for the purposes of this function, tables are NOT considered sparse when:
$$min\left[ \frac{n_A}{H_{D_m} \times J}, \frac{n_B}{H_{D_m} \times K} \right] > 1 $$
This rule is applied when deciding how to proceed with estimation in case of sparse table (argument deal.sparse
).
Note that sparseness can be measured in different manners. The outputs include also the empty cells in each table (due to statistical zeros or structural zeros) and the Cohen's effect size with respect to the case of uniform distribution of frequencies across cells (the value 1/no.of.cells in every cell):
$$\omega_{eq} = \sqrt{H \sum_{h=1}^{H} (\hat{p}_h - 1/H)^2 } $$
values of \(\omega_{eq}\) jointly with \(n/H \leq 1\) usually indicate severe sparseness.
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
D'Orazio, M., Di Zio, M. and Scanu, M. (2017). ``The use of uncertainty to choose matching variables in statistical matching''. International Journal of Approximate Reasoning , 90, pp. 433-440.
D'Orazio, M., Di Zio, M. and Scanu, M. (2019). ``Auxiliary variable selection in a a statistical matching problem''. In Zhang, L.-C. and Chambers, R. L. (eds.) Analysis of Integrated Data, Chapman & Hall/CRC (Forthcoming).
Frechet.bounds.cat
, harmonize.x
data(quine, package="MASS") #loads quine from MASS
str(quine)
quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100))
table(quine$c.Days)
# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(4567)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:4]
quine.B <- quine[-lab.A, c(1:3,6)]
# compute the tables required by Fbwidths.by.x()
freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A)
freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B)
freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A)
freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B)
# apply Fbwidths.by.x()
bounds.yz <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy,
tab.xz=freq.xz)
bounds.yz$sum.unc
Run the code above in your browser using DataLab