Provide suitable “views” of the text contained in text documents.
words(x, ...)
sents(x, ...)
paras(x, ...)
tagged_words(x, ...)
tagged_sents(x, ...)
tagged_paras(x, ...)
chunked_sents(x, ...)
parsed_sents(x, ...)
parsed_paras(x, ...)
For words()
, a character vector with the word tokens in the
document.
For sents()
, a list of character vectors with the word tokens
in the sentences.
For paras()
, a list of lists of character vectors with the word
tokens in the sentences, grouped according to the paragraphs.
For tagged_words()
, a character vector with the POS tagged word
tokens in the document (i.e., the word tokens and their POS tags,
separated by /).
For tagged_sents()
, a list of character vectors with the POS
tagged word tokens in the sentences.
For tagged_paras()
, a list of lists of character vectors with
the POS tagged word tokens in the sentences, grouped according to the
paragraphs.
For chunked_sents()
, a list of (flat) Tree
objects giving the chunk trees for the sentences in the document.
For parsed_sents()
, a list of Tree
objects giving the parse trees for the sentences in the document.
For parsed_paras()
, a list of lists of Tree
objects giving the parse trees for the sentences in the document, grouped according to the paragraphs in the document.
For otoks()
, a character vector with the orthographic word
tokens in the document.
a text document object.
further arguments to be passed to or from methods.
Methods for extracting POS tagged word tokens (i.e., for generics
tagged_words()
, tagged_sents()
and
tagged_paras()
) can optionally provide a mechanism for mapping
the POS tags via a map
argument. This can give a function, a
named character vector (with names and elements the tags to map from
and to, respectively), or a named list of such named character
vectors, with names corresponding to POS tagsets (see
Universal_POS_tags_map
for an example). If a list, the
map used will be the element with name matching the POS tagset used
(this information is typically determined from the text document
metadata; see the the help pages for text document extension classes
implementing this mechanism for details).
Text document classes may provide support for representing both
(syntactic) words (for which annotations can be provided) and
orthographic (word) tokens, e.g., in Spanish dámelo = da me lo.
For these, words()
gives the syntactic word tokens, and
otoks()
the orthographic word tokens. This is currently
supported for CoNNL-U text documents (see
https://universaldependencies.org/format.html for more
information) and annotated plain
text documents (via word
features as used for example for some
Stanford CoreNLP annotator pipelines provided by package
StanfordCoreNLP available from the repository at
https://datacube.wu.ac.at).
In addition to methods for the text document classes provided by
package NLP itself, (see TextDocument), package NLP
also provides word tokens and POS tagged word tokens for the results
of
udpipe_annotate()
from package udpipe,
spacy_parse()
from package spacyr,
and
cnlp_annotate()
from package cleanNLP.
TextDocument
for basic information on the text document
infrastructure employed by package NLP.
## Example from :
d <- CoNLLUTextDocument(system.file("texts", "spanish.conllu",
package = "NLP"))
content(d)
## To extract the syntactic words:
words(d)
## To extract the orthographic word tokens:
otoks(d)
Run the code above in your browser using DataLab