pBayes: Pseudo-Bayes estimates of cell probabilities

Description

Estimation of cells counts in contingency tables by means of the pseudo-Bayes estimator.

Usage

pBayes(x, method="m.ind", const=NULL)

Value

A list object with three components.

info: A vector with the sample size "n", the number of cells ("no.cells") in x, the average cell frequency ("av.cfr"), the number of cells showing frequencies equal to zero ("no.0s"), the const input argument, the chosen/estimated $K$ ("K") and the relative size of $K$, i.e. $K/(n+K)$ ("rel.K").
prior: A table having the same dimension as x with the considered prior values for the cell frequencies.
pseudoB: A table with having the same dimension as x providing the pseudo-Bayes estimates for the cell frequencies in x.

Arguments

x

A contingency table with observed cell counts. Typically the output of table or xtabs. More in general an R array with the counts.

method

The method for estimating the final cell frequencies. The following options are available:

method = "Jeffreys", consists in adding 0.5 to each cell before estimation of the relative frequencies.

method = "minimax", consists in adding $\sqrt(n)/c$ to each cell before estimation of the relative frequencies, being $n$ the sum of all the counts and $c$ the number of cells in the table.

method = "invcat", consists in adding $1/c$ to each cell before estimation of the relative frequencies.

method = "user", consists in adding a used defined constant $a$ ($a>0$) to each cell before estimation of the relative frequencies. The constant $a$ should be passed via the argument const.

method = "m.ind", the prior guess for the unknown cell probabilities is obtained by considering estimated probabilities under the mutual independence hypothesis. This option is available when dealing with at least two-way contingency tables
(length(dim(x))>=2).

method = "h.assoc", the prior guess for the unknown cell probabilities is obtained by considering estimated probabilities under the homogeneous association hypothesis. This option is available when dealing with at least two-way contingency tables (length(dim(x))>=2).

const

Numeric value, a user defined constant $a$ ($a>0$) to be added to each cell before estimation of the relative frequencies when method = "user". As a general rule of thumb, it is preferable to avoid that the sum of constant over all the cells is greater than $0.20 \times n$.

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

This function estimates the frequencies in a contingency table by using the pseudo-Bayes approach. In practice the estimator being considered is a weighted average of the input (observed) cells counts $n_h$ and a suitable prior guess, $\gamma_h$, for cells probabilities :

$$\tilde{p}_h = \frac{n}{n+K} \hat{p}_h + \frac{K}{n+K} \gamma_h $$

$K$ depends on the parameters of Dirichlet prior distribution being considered (for major details see Chapter 12 in Bishop et al., 1974). It is worth noting that with a constant prior guess $\gamma_h=1/c$ ($h=1,2,\cdots, c$), then $K=1$ and in practice corresponds to adding $1/c$ to each cell before estimation of the relative frequencies (method = "invcat"); $K=c/2$ when the constant 0.5 is added to each cell (method = "Jeffreys"); finally $K=\sqrt{n}$ when the quantity $\sqrt{n}/c$ is added to each cell (method = "minimax"). All these cases corresponds to adding a flattening constant; the higher is the value of $K$ the more the estimates will be shrinked towards $\gamma_h=1/c$ (flattening).

When method = "m.ind" the prior guess $\gamma_h$ is estimated under the hypothesis of mutual independence between the variables crossed in the initial contingency table x, supposed to be at least a two-way table. In this case the value of $K$ is estimated via a data-driven approach by considering

$$ \hat{K} = \frac{1 - \sum_{h} \hat{p}_h^2}{\sum_{h} \left( \hat{\gamma}_h - \hat{p}_h \right)^2 } $$

On the contrary, when method = "h.assoc" the prior guess $\gamma_h$ is estimated under the hypothesis of homogeneous association between the variables crossed in the initial contingency table x.

Please note that when the input table is estimated from sample data where a weight is assigned to each unit, the weights should be used in estimating the input table, but it is suggested to rescale them so that their sum is equal to n, the sample size.

References

Bishop Y.M.M., Fienberg, S.E., Holland, P.W. (1974) Discrete Multivariate Analysis: Theory and Practice. The Massachusetts Institute of Technology

Examples

Run this code


data(samp.A, package="StatMatch")
tab <- xtabs(~ area5 + urb + c.age + sex + edu7, data = samp.A)
out.pb <- pBayes(x=tab, method="m.ind")
out.pb$info

out.pb <- pBayes(x=tab, method="h.assoc")
out.pb$info

out.pb <- pBayes(x=tab, method="Jeffreys")
out.pb$info

# usage of weights in estimating the input table
n <- nrow(samp.A)
r.w <- samp.A$ww / sum(samp.A$ww) * n   # rescale weights to sum up to n 
tab.w <- xtabs(r.w ~ area5 + urb + c.age + sex + edu7, data = samp.A)
out.pbw <- pBayes(x=tab.w, method="m.ind")
out.pbw$info

Run the code above in your browser using DataLab