Create text documents from CoNNL-U format files.
CoNLLUTextDocument(con, meta = list())
a connection object or a character string.
See scan()
for details.
a named or empty list of document metadata tag-value pairs.
An object inheriting from "CoNLLUTextDocument"
and
"TextDocument"
.
The CoNLL-U format (see
http://universaldependencies.org/format.html)
is a CoNLL-style format for annotated texts popularized and employed
by the Universal Dependencies project
(see http://universaldependencies.org/).
For each “word” in the text, this provides exactly the 10
fields
ID
,
FORM
(word form or punctuation symbol),
LEMMA
(lemma or stem of word form),
UPOSTAG
(universal part-of-speech tag, see
http://universaldependencies.org/u/pos/index.html),
XPOSTAG
(language-specific part-of-speech tag, may be
unavailable),
FEATS
(list of morphological features),
HEAD
,
DEPREL
,
DEPS
, and
MISC
.
The lines with these fields and optional comments are read from the
given connection and split into fields using scan()
.
This is combined with consecutive sentence ids into a data frame used
for representing the annotation information, and together with the
given metadata returned as a CoNLL-U text document inheriting from
classes "CoNLLUTextDocument"
and "TextDocument"
.
The complete annotation information data frame can be extracted via
content()
. CoNLL-U v2 requires providing the complete texts of
each sentence (or a reconstruction thereof) in # text = comment
lines. Where consistently provided, these are made avaialable in the
text
attribute of the content data frame.
In addition, there are methods for generics
as.character()
,
words()
,
sents()
,
tagged_words()
, and
tagged_sents()
and class "CoNLLUTextDocument"
,
which should be used to access the text in such text document
objects.
The CoNLL-U format allows to represent both words and (multiword)
tokens (see section ‘Words, Tokens and Empty Nodes’ in the
format documentation), as distinguished by ids being integers or
integer ranges, with the words being annotated further. One can
use as.character()
to extract the tokens; all other
viewers listed above use the words. Finally, the viewers
incorporating POS tags take a which
argument to specify using
the univeral or language-specific tags, by giving a substring of
"UPOSTAG"
(default) or "XPOSTAG"
.
TextDocument
for basic information on the text document
infrastructure employed by package NLP.
http://universaldependencies.org/ for access to the Universal Dependencies treebanks, which provide annotated texts in many different languages using CoNLL-U format.