In this dataset, types are not words, but syntactic expansions,
i.e., sequences of syntactic categories that form NPs (in
TigerNP
) or PPs (in TigerPP
), according to the Tiger
annotation scheme for German. Thus, for example, among the expansion
types in the TigerNP
dataset, we find ART_NN
and
ART_ADJA_NN
, whereas among the PP expansions in
TigerPP
we find APPR_ART_NN
and APPR_NN
(APPR
is the tag for prepositions in the Tiger tagset).
The Tiger treebank contains about 900,000 tokens (50,000 sentences)
of German newspaper text from the Frankfurter Rundschau. The token
frequencies of the expansion types are taken from this corpus.
TigerNP.tfl
and TigerPP.tfl
are the type frequency
lists. TigerNP.spc
and TigerPP.spc
are frequency
spectra. TigerNP.emp.vgc
and TigerPP.emp.vgc
are the
corresponding observed vocabulary growth curves (tracking the
development of V
and V(1)
in the original order of
occurrence of the expansion tokens in the source corpus).