prop.cint: Confidence interval for proportion based on frequency counts (corpora)

Description

This function computes a confidence interval for a population proportion from the corresponding frequency count in a sample. It either uses the Clopper-Pearson method (inverted exact binomial test) or the Wilson score method (inversion of a z-score test, with or without continuity correction).

Usage

prop.cint(k, n, method = c("binomial", "z.score"), correct = TRUE,
          conf.level = 0.95, alternative = c("two.sided", "less", "greater"))

Arguments

frequency of a type in the corpus (or an integer vector of frequencies)

number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples)

method

a character string specifying whether to compute a Clopper-Pearson confidence interval (binomial) or a Wilson score interval (z.score) is computed

correct

if TRUE, apply Yates' continuity correction for the z-score test (default)

conf.level

the desired confidence level (defaults to 95%)

alternative

a character string specifying the alternative hypothesis, yielding a two-sided (two.sided, default), lower one-sided (less) or upper one-sided (greater) confidence interval

Value

A data frame with two columns, labelled lower for the lower boundary and upper for the upper boundary of the confidence interval. The number of rows is determined by the length of the longest input vector (k, n and conf.level).

Details

The confidence intervals computed by this function correspond to those returned by binom.test and prop.test, respectively. However, prop.cint accepts vector arguments, allowing many confidence intervals to be computed with a single function call. In addition, it uses a fast approximation of the two-sided binomial test that can safely be applied to large samples.

The confidence interval for a z-score test is computed by solving the z-score equation $$% \frac{k - np}{\sqrt{n p (1-p)}} = \alpha $$ for $p$, where $\alpha$ is the $z$-value corresponding to the chosen confidence level (e.g. $\pm 1.96$ for a two-sided test with 95% confidence). This leads to the quadratic equation $$% p^2 (n + \alpha^2) + p (-2k - \alpha^2) + \frac{k^2}{n} = 0 $$ whose two solutions correspond to the lower and upper boundary of the confidence interval.

When Yates' continuity correction is applied, the value $k$ in the numerator of the $z$-score equation has to be replaced by $k^*$, with $k^* = k - 1/2$ for the lower boundary of the confidence interval (where $k > np$) and $k^* = k + 1/2$ for the upper boundary of the confidence interval (where $k < np$). In each case, the corresponding solution of the quadratic equation has to be chosen (i.e., the solution with $k > np$ for the lower boundary and vice versa).

References

http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval