A TAB-delimited text file with column headers but no row names
(suitable for reading with read.delim
). The file must contain
at least the following two columns:
N
increasing integer vector of sample sizes \(N\)
V
corresponding observed vocabulary sizes \(V(N)\)
or expected vocabulary sizes \(E[V(N)]\)
Optionally, columns V1
, …, V9
can be added to
specify the number of hapaxes (\(V_1(N)\)), dis legomena
(\(V_2(N)\)), and further spectrum elements up to \(V_9(N)\).
It is not necessary to include all 9 columns, but for any \(V_m(N)\)
in the data set, all "lower" spectrum elements \(V_{m'}(N)\) (for
\(m' < m\)) must also be present. For example, it is valid to have
columns V1 V2 V3
, but not V1 V3 V5
or V2 V3 V4
.
Variances for expected vocabulary sizes and spectrum elements can be
given in further columns VV
(for
\(\mathop{Var}[V(N)]\)), and VV1
, …,
VV9
(for \(\mathop{Var}[V_m(N)]\)). VV
is mandatory in this case, and columns VVm
must be specified
for exactly the same frequency classes m
as the Vm
above.
These columns may appear in any order in the text file. All other
columns will be silently ignored.