Provide suitable “views” of the text contained in text documents.
words(x, ...)
sents(x, ...)
paras(x, ...)
tagged_words(x, ...)
tagged_sents(x, ...)
tagged_paras(x, ...)
chunked_sents(x, ...)
parsed_sents(x, ...)
parsed_paras(x, ...)
a text document object.
further arguments to be passed to or from methods.
For words()
, a character vector with the word tokens in the
document.
For sents()
, a list of character vectors with the word tokens
in the sentences.
For paras()
, a list of lists of character vectors with the word
tokens in the sentences, grouped according to the paragraphs.
For tagged_words()
, a character vector with the POS tagged word
tokens in the document (i.e., the word tokens and their POS tags,
separated by /).
For tagged_sents()
, a list of character vectors with the POS
tagged word tokens in the sentences.
For tagged_paras()
, a list of lists of character vectors with
the POS tagged word tokens in the sentences, grouped according to the
paragraphs.
For chunked_sents()
, a list of (flat) Tree
objects giving the chunk trees for the sentences in the document.
For parsed_sents()
, a list of Tree
objects giving the parse trees for the sentences in the document.
For parsed_paras()
, a list of lists of Tree
objects giving the parse trees for the sentences in the document,
grouped according to the paragraphs in the document.
Methods for extracting POS tagged word tokens (i.e., for generics
tagged_words()
, tagged_sents()
and
tagged_paras()
) can optionally provide a mechanism for mapping
the POS tags via a map
argument. This can give a function, a
named character vector (with names and elements the tags to map from
and to, respectively), or a named list of such named character
vectors, with names corresponding to POS tagsets (see
Universal_POS_tags_map
for an example). If a list, the
map used will be the element with name matching the POS tagset used
(this information is typically determined from the text document
metadata; see the the help pages for text document extension classes
implementing this mechanism for details).
In addition to methods for the text document classes provided by
package NLP itself, (see TextDocument), package NLP
also provides word tokens and POS tagged word tokens for the results
of
udpipe_annotate()
from package udpipe,
spacy_parse()
from package spacyr,
and
cnlp_annotate()
from package cleanNLP.
TextDocument
for basic information on the text document
infrastructure employed by package NLP.