Input strings must be POS-disambiguated lemmas in CWB/Penn notation, i.e. in the form
<headword>_<P>
where <headword>
is a dictionary headword (usually case-sensitive) and <P>
is
a one-letter code specifying the simple part of speech. Standard POS codes are
N ... nouns and proper nouns
V ... lexical and auxiliary verbs
J ... adjectives
R ... adverbs
For other parts of speech, the first character of the corresponding Penn tag may be used.
Note that these codes are not standardised and are only useful for distinguishing between content
words and function words.
The following output formats are supported:
CWB
returns input strings without modifications, but validates that they are in CWB/Penn format
BNC
BNC-style POS-disambiguated lemmas based on the simplified CLAWS tagset.
The headword part of the lemma is unconditionally converted to lowercase.
The standard POS codes listed above are translated into
SUBST
(nouns), VERB
(verbs), ADJ
(adjectives) and ADV
(adverbs).
Other POS codes have no direct CLAWS equivalents and are mapped to UNC
(unclassified),
so the transformation should only be used for noun, verbs, adjectives and adverbs.
DM
POS-disambiguated lemmas in the format used by Distributional Memory (Baroni & Lenci 2010),
viz. <headword>-<p>
with POS code in lowercase and headword in its original capitalisation.
For example, light_N
will be mapped to light-n
.
HW
just the undisambiguated headword
HWLC
undisambiguated headword mapped to lowercase