Transform POS-disambiguated lemma strings in CWB/Penn format (see Details) into several other notation formats.
convert.lemma(lemma, format=c("CWB", "BNC", "DM", "HW", "HWLC"), hw.tolower=FALSE)
A character vector of the same length as lemma
, containing the transformed lemmas.
See Details above for the different output formats.
a character vector specifying one or more POS-disambiguated lemmas in CWB/Penn notation
the notation format to be generated (see Details)
convert headword part to lowercase, regardless of output format
Stephanie Evert (https://purl.org/stephanie.evert)
Input strings must be POS-disambiguated lemmas in CWB/Penn notation, i.e. in the form
<headword>_<P>
where <headword>
is a dictionary headword (which may be case-sensitive) and <P>
is
a one-letter code specifying the simple part of speech. Standard POS codes are
N ... nouns
Z ... proper nouns
V ... lexical and auxiliary verbs
J ... adjectives
R ... adverbs
I ... prepositions (including all uses of "to")
D ... determiners
. ... punctuation
For other parts of speech, the first character of the corresponding Penn tag may be used. Note that these codes are not standardised and are only useful for distinguishing between content words and function words.
The following output formats are supported:
CWB
returns input strings without modifications, but validates that they are in CWB/Penn format
BNC
BNC-style POS-disambiguated lemmas based on the simplified CLAWS tagset.
The headword part of the lemma is unconditionally converted to lowercase.
The standard POS codes listed above are translated into
SUBST
(nouns and proper nouns), VERB
(verbs), ADJ
(adjectives), ADV
(adverbs),
ART
(determiners), PREP
(prepositions), and STOP
(punctuation).
Other POS codes have no direct CLAWS equivalents and are mapped to UNC
(unclassified),
so the transformation should only be used for the categories listed above.
DM
POS-disambiguated lemmas in the format used by Distributional Memory (Baroni & Lenci 2010),
viz. <headword>-<p>
with POS code in lowercase and headword in its original capitalisation.
For example, light_N
will be mapped to light-n
.
HW
just the undisambiguated headword
HWLC
undisambiguated headword mapped to lowercase (same as HW
with hw.tolower=TRUE
)
Baroni, Marco and Lenci, Alessandro (2010). Distributional Memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673--712.
convert.lemma(RG65$word1, "CWB") # original format
convert.lemma(RG65$word1, "BNC") # BNC-style (simple CLAWS tags)
convert.lemma(RG65$word1, "DM") # as in Distributional Memory
convert.lemma(RG65$word1, "HW") # just the headword
Run the code above in your browser using DataLab