convert.lemma: Transform CWB/Penn-Style Lemmas into Other Notation Formats (wordspace)

Description

Transform POS-disambiguated lemma strings in CWB/Penn format (see Details) into several other notation formats.

Usage

convert.lemma(lemma, format=c("CWB", "BNC", "DM", "HW", "HWLC"), hw.tolower=FALSE)

Value

A character vector of the same length as lemma, containing the transformed lemmas. See Details above for the different output formats.

Arguments

lemma: a character vector specifying one or more POS-disambiguated lemmas in CWB/Penn notation
format: the notation format to be generated (see Details)
hw.tolower: convert headword part to lowercase, regardless of output format

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

Input strings must be POS-disambiguated lemmas in CWB/Penn notation, i.e. in the form


    <headword>_<P>

where <headword> is a dictionary headword (which may be case-sensitive) and <P> is a one-letter code specifying the simple part of speech. Standard POS codes are


    N ... nouns
    Z ... proper nouns
    V ... lexical and auxiliary verbs
    J ... adjectives
    R ... adverbs
    I ... prepositions (including all uses of "to")
    D ... determiners
    . ... punctuation

For other parts of speech, the first character of the corresponding Penn tag may be used. Note that these codes are not standardised and are only useful for distinguishing between content words and function words.

The following output formats are supported:

CWB: returns input strings without modifications, but validates that they are in CWB/Penn format
BNC: BNC-style POS-disambiguated lemmas based on the simplified CLAWS tagset. The headword part of the lemma is unconditionally converted to lowercase. The standard POS codes listed above are translated into SUBST (nouns and proper nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), ART (determiners), PREP (prepositions), and STOP (punctuation). Other POS codes have no direct CLAWS equivalents and are mapped to UNC (unclassified), so the transformation should only be used for the categories listed above.
DM: POS-disambiguated lemmas in the format used by Distributional Memory (Baroni & Lenci 2010), viz. <headword>-<p> with POS code in lowercase and headword in its original capitalisation. For example, light_N will be mapped to light-n.
HW: just the undisambiguated headword
HWLC: undisambiguated headword mapped to lowercase (same as HW with hw.tolower=TRUE)

References

Baroni, Marco and Lenci, Alessandro (2010). Distributional Memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673--712.

Examples

Run this code


convert.lemma(RG65$word1, "CWB") # original format
convert.lemma(RG65$word1, "BNC") # BNC-style (simple CLAWS tags)
convert.lemma(RG65$word1, "DM")  # as in Distributional Memory
convert.lemma(RG65$word1, "HW")  # just the headword

Run the code above in your browser using DataLab