convert.lemma: Transform CWB/Penn-Style Lemmas into Other Notation Formats (wordspace)

Description

Transform POS-disambiguated lemma strings in CWB/Penn format (see Details) into several other notation formats.

Usage

convert.lemma(lemma, format=c("CWB", "BNC", "DM", "HW", "HWLC"))

Arguments

lemma

a character vector specifying one or more POS-disambiguated lemmas in CWB/Penn notation

format

the notation format to be generated (see Details)

Value

A character vector of the same length as lemma, containing the transformed lemmas. See Details above for the different output formats.

Details

Input strings must be POS-disambiguated lemmas in CWB/Penn notation, i.e. in the form

    <headword>_<P>

where <headword> is a dictionary headword (usually case-sensitive) and <P> is a one-letter code specifying the simple part of speech. Standard POS codes are

    N ... nouns and proper nouns
    V ... lexical and auxiliary verbs
    J ... adjectives
    R ... adverbs

For other parts of speech, the first character of the corresponding Penn tag may be used. Note that these codes are not standardised and are only useful for distinguishing between content words and function words.

The following output formats are supported:

CWB: returns input strings without modifications, but validates that they are in CWB/Penn format
BNC: BNC-style POS-disambiguated lemmas based on the simplified CLAWS tagset. The headword part of the lemma is unconditionally converted to lowercase. The standard POS codes listed above are translated into SUBST (nouns), VERB (verbs), ADJ (adjectives) and ADV (adverbs). Other POS codes have no direct CLAWS equivalents and are mapped to UNC (unclassified), so the transformation should only be used for noun, verbs, adjectives and adverbs.
DM: POS-disambiguated lemmas in the format used by Distributional Memory (Baroni & Lenci 2010), viz. <headword>-<p> with POS code in lowercase and headword in its original capitalisation. For example, light_N will be mapped to light-n.
HW: just the undisambiguated headword
HWLC: undisambiguated headword mapped to lowercase

References

Baroni, Marco and Lenci, Alessandro (2010). Distributional Memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673--712.

Examples

Run this code

# NOT RUN {
convert.lemma(RG65$word1, "CWB") # original format
convert.lemma(RG65$word1, "BNC") # BNC-style (simple CLAWS tags)
convert.lemma(RG65$word1, "DM")  # as in Distributional Memory
convert.lemma(RG65$word1, "HW")  # just the headword

# }

Run the code above in your browser using DataLab