Learn R Programming

NLP (version 0.2-0)

CoNLLUTextDocument: CoNNL-U Text Documents

Description

Create text documents from CoNNL-U format files.

Usage

CoNLLUTextDocument(con, meta = list())

Arguments

con

a connection object or a character string. See scan() for details.

meta

a named or empty list of document metadata tag-value pairs.

Value

An object inheriting from "CoNLLUTextDocument" and "TextDocument".

Details

The CoNLL-U format (see http://universaldependencies.org/format.html) is a CoNLL-style format for annotated texts popularized and employed by the Universal Dependencies project (see http://universaldependencies.org/). For each “word” in the text, this provides exactly the 10 fields ID, FORM (word form or punctuation symbol), LEMMA (lemma or stem of word form), UPOSTAG (universal part-of-speech tag, see http://universaldependencies.org/u/pos/index.html), XPOSTAG (language-specific part-of-speech tag, may be unavailable), FEATS (list of morphological features), HEAD, DEPREL, DEPS, and MISC.

The lines with these fields and optional comments are read from the given connection and split into fields using scan(). This is combined with consecutive sentence ids into a data frame used for representing the annotation information, and together with the given metadata returned as a CoNLL-U text document inheriting from classes "CoNLLUTextDocument" and "TextDocument".

The complete annotation information data frame can be extracted via content(). CoNLL-U v2 requires providing the complete texts of each sentence (or a reconstruction thereof) in # text = comment lines. Where consistently provided, these are made avaialable in the text attribute of the content data frame.

In addition, there are methods for generics as.character(), words(), sents(), tagged_words(), and tagged_sents() and class "CoNLLUTextDocument", which should be used to access the text in such text document objects.

The CoNLL-U format allows to represent both words and (multiword) tokens (see section ‘Words, Tokens and Empty Nodes’ in the format documentation), as distinguished by ids being integers or integer ranges, with the words being annotated further. One can use as.character() to extract the tokens; all other viewers listed above use the words. Finally, the viewers incorporating POS tags take a which argument to specify using the univeral or language-specific tags, by giving a substring of "UPOSTAG" (default) or "XPOSTAG".

See Also

TextDocument for basic information on the text document infrastructure employed by package NLP.

http://universaldependencies.org/ for access to the Universal Dependencies treebanks, which provide annotated texts in many different languages using CoNLL-U format.