Provide suitable “views” of the text contained in text documents.
words(x, ...)
sents(x, ...)
paras(x, ...)
tagged_words(x, ...)
tagged_sents(x, ...)
tagged_paras(x, ...)
chunked_sents(x, ...)
parsed_sents(x, ...)
parsed_paras(x, ...)
a text document object.
further arguments to be passed to or from methods.
For words()
, a character vector with the word tokens in the
document.
For sents()
, a list of character vectors with the word tokens
in each sentence.
For paras()
, a list of lists of character vectors with the word
tokens in each sentence, grouped according to the paragraphs.
For tagged_words()
, a character vector with the POS tagged word
tokens in the document (i.e., the word tokens and their POS tags,
separated by /).
For tagged_sents()
, a list of character vectors with the POS
tagged word tokens in each sentence.
For tagged_paras()
, a list of lists of character vectors with
the POS tagged word tokens in each sentence, grouped according to the
paragraphs.
For chunked_sents()
, a list of (flat) Tree
objects giving the chunk trees for each sentence in the document.
For parsed_sents()
, a list of Tree
objects giving the parse trees for each sentence in the document.
For parsed_paras()
, a list of lists of Tree
objects giving the parse trees for each sentence in the document,
grouped according to the paragraphs in the document.
Methods for extracting POS tagged word tokens (i.e., for generics
tagged_words()
, tagged_sents()
and
tagged_paras()
) can optionally provide a mechanism for mapping
the POS tags via a map
argument. This can give a function, a
named character vector (with names and elements the tags to map from
and to, respectively), or a named list of such named character
vectors, with names corresponding to POS tagsets (see
Universal_POS_tags_map
for an example). If a list, the
map used will be the element with name matching the POS tagset used
(this information is typically determined from the text document
metadata; see the the help pages for text document extension classes
implementing this mechanism for details).
TextDocument
for basic information on the text document
infrastructure employed by package NLP.