productivity.measures: Measures of Productivity and Lexical Richness (zipfR)

Description

Compute various measures of productivity and lexical richness from an observed frequency spectrum or type-frequency list, from an observed vocabulary growth curve, or from a vector of tokens.

Usage

productivity.measures(obj, measures, data.frame=TRUE, ...)
# S3 method for tfl
productivity.measures(obj, measures, data.frame=TRUE, ...)
# S3 method for spc
productivity.measures(obj, measures, data.frame=TRUE, ...)
# S3 method for vgc
productivity.measures(obj, measures, data.frame=TRUE, ...)
# S3 method for default
productivity.measures(obj, measures, data.frame=TRUE, ...)

Arguments

obj

a suitable data object from which productivity measures can be computed. Currently either a frequency spectrum (of class spc), a type-frequency list (of class tfl), a vocabulary growth curve (of class vgc), or a token vector.

measures

character vector naming the productivity measures to be computed (see "Productivity Measures" below). Names may be abbreviated as long as they remain unique. If unspecified, all supported measures are computed.

data.frame

if TRUE, the return value is converted to a data frame for convenience in interactive use (default).

...

additional arguments passed on to the method implementations (currently, no further arguments are recognized)

Value

If obj is a frequency spectrum, type-frequency list or token vector: A numeric vector of the same length as measures with the corresponding observed values of the productivity measures. If data.frame=TRUE (the default), a single-row data frame is returned.

If obj is a vocabulary growth curve: A numeric matrix with columns corresponding to the selected productivity measures and rows corresponding to the sample sizes of the vocabulary growth curve. If data.frame=TRUE (the default), the matrix is converted to a data frame.

Productivity Measures

The following productivity measures are currently supported:

V:: the total number of types \(V\)
TTR:: the type-token ratio TTR = \(V / N\)
R:: Guiraud's (1954) \(R = V / \sqrt{N}\). An equivalent measure is Carroll's (1964) \(CTTR = R / \sqrt{2}\).
C:: Herdan's (1964) \(C = \frac{ \log V }{ \log N }\)
k:: Dugast's (1979) \(k = \frac{ \log V }{ \log \log N}\)
U:: Dugast's (1978, 1979) \(U = \frac{ (\log N)^2 }{ \log N - \log V}\). Maas (1972) proposed an equivalent measure \(a^2 = 1 / U\).
W:: Brunet's (1978) \(W = N ^ {V ^ {-a}}\) with \(a = 0.172\).
P:: Baayen's (1991) productivity index \(P = \frac{V_1}{N}\), which corresponds to the slope of the vocabulary growth curve (under random sampling assumptions)
Hapax:: the proportion of hapax legomena \(\frac{V_1}{V}\) is a direct estimate for the parameter \(\alpha = 1 / a\) of a population following the Zipf-Mandelbrot law (Evert 2004b: 130).
H:: Honor<U+00E9>'s (1979) \(H = 100 \frac{ \log N }{ 1 - V_1 / V }\), a transformation of the proportion of hapax legomena adjusted for sample size
S:: Sichel's (1975) \(S = V_2 / V\), i.e. the proportion of dis legomena. Mich<U+00E9>a's (1969, 1971) \(M = 1 / S\) is an equivalent measure.
alpha2:: Evert's \(\alpha_2 = 1 - 2 \frac{V_2}{V_1}\) is another direct estimate for the parameter \(\alpha = 1 / a\) of a Zipf-Mandelbrot population (Evert 2004b: 127).
K:: Yule's (1944) \(K = 10^4 \cdot \frac{ \sum_m m^2 V_m - N}{ N^2 }\) (only for complete frequency spectrum or type-frequency list). Herdan (1955) proposes an almost equivalent measure \(v_m \approx \sqrt{K}\) based on a different derviation. Both measures converge for large \(N\) and \(V\). Yule's \(K\) is almost identical to Simpson's \(D\) and is an unbiased estimator for the same population coefficient \(\delta\) under an independent Poisson sampling scheme. A measure of lexical poverty, i.e. smaller values correpond to higher productivity.
D:: Simpson's (1949) \(D = \sum_m V_m \frac{m}{N}\cdot \frac{m-1}{N-1}\) (only for complete frequency spectrum or type-frequency list) is a slightly modified version of Yule's \(K\). This measure is an unbiased estimator for a population coefficient \(\delta\), representing the probability of picking the same type twice in two consecutive draws from the population. A measure of lexical poverty, i.e. smaller values correpond to higher productivity.
Entropy:: Entropy of the sample frequency distribution \(-\sum_m V_m \frac{m}{N} \log_2 \frac{m}{N}\) (only for complete frequency spectrum or type-frequency list). This is not a reliable estimator of population entropy. It is therefore not recommended as a productivity measure and has only been included for evaluation studies. A measure of lexical poverty, i.e. smaller values correpond to higher productivity.
eta:: Normalised entropy or evenness \(\eta = \textrm{Entropy} / \log_2 V\) (only for complete frequency spectrum or type-frequency list) where \(\log_2 V\) is the largest possible value for a sample with the observed vocabulary size (obtained for a uniform distribution). Therefore, \(0 \le \eta \le 1\). Not recommended as a productivity measure because it is expected to produce erratic and counterintuitive results.

See Sec. 2.1 of the technical report Inside zipfR for further details and references.

Details

This function computes productivity measures based on an observed frequency spectrum, type-frequency list or vocabulary growth curve. If an expected spectrum or VGC is passed, the expectations \(E[V]\), \(E[V_m]\) will simply be substituted for the sample values \(V\), \(V_m\) in the equations. In most cases, this does not yield the expected value of the productivity measure!

Some measures can only be computed from a complete frequency spectrum. They will return NA if obj is an incomplete spectrum or type-frequency list, an expected spectrum or a vocabulary growth curve is passed.

Some other measures can only be computed is a sufficient number of spectrum elements is included in a vocabulary growth curve (usually at least \(V_1\) and \(V_2\)), and will return NA otherwise.

Such limitations are indicated in the list of measures below (unless spectrum elements \(V_1\) and \(V_2\) are sufficient).

References

Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://dx.doi.org/10.18419/opus-2556

Examples

Run this code

# NOT RUN {
rbind(
  AllTexts=productivity.measures(Brown.spc),
  Fiction=productivity.measures(BrownImag.spc),
  NonFiction=productivity.measures(BrownInform.spc))

## can be applied to token vector, type-frequency list, or frequency spectrum
bar.vec <- EvertLuedeling2001$bar
bar1 <- productivity.measures(bar.vec)          # token vector
bar2 <- productivity.measures(vec2tfl(bar.vec)) # type-frequency list
bar3 <- productivity.measures(vec2spc(bar.vec)) # frequency spectrum
print(rbind(tokens=bar1, tfl=bar2, spc=bar3))
# }
# NOT RUN {
## sample-size dependency of productivity measures in Brown corpus
## (note that only a subset of the measures can be computed)
n <- c(10e3, 50e3, 100e3, 200e3, 500e3, 1e6)
idx <- N(Brown.emp.vgc) %in% n
my.vgc <- vgc(N=N(Brown.emp.vgc)[idx],
              V=V(Brown.emp.vgc)[idx],
              Vm=list(Vm(Brown.emp.vgc, 1)[idx]))
print(my.vgc) # since we don't have a subset method for VGCs yet
productivity.measures(my.vgc)

productivity.measures(my.vgc, measures=c("TTR", "P")) # selected measures

## parametric bootstrapping to obtain sampling distribution of measures
## (much easier with ?lnre.productivity.measures)
model <- lnre("zm", spc=ItaRi.spc) # realistic LNRE model
res <- lnre.bootstrap(model, 1e6, ESTIMATOR=identity,
                      STATISTIC=productivity.measures)
bootstrap.confint(res, method="normal")
# }

Run the code above in your browser using DataLab