Learn R Programming

NLP (version 0.3-0)

viewers: Text Document Viewers

Description

Provide suitable “views” of the text contained in text documents.

Usage

words(x, ...)
sents(x, ...)
paras(x, ...)
tagged_words(x, ...)
tagged_sents(x, ...)
tagged_paras(x, ...)
chunked_sents(x, ...)
parsed_sents(x, ...)
parsed_paras(x, ...)

Value

For words(), a character vector with the word tokens in the document.

For sents(), a list of character vectors with the word tokens in the sentences.

For paras(), a list of lists of character vectors with the word tokens in the sentences, grouped according to the paragraphs.

For tagged_words(), a character vector with the POS tagged word tokens in the document (i.e., the word tokens and their POS tags, separated by /).

For tagged_sents(), a list of character vectors with the POS tagged word tokens in the sentences.

For tagged_paras(), a list of lists of character vectors with the POS tagged word tokens in the sentences, grouped according to the paragraphs.

For chunked_sents(), a list of (flat) Tree

objects giving the chunk trees for the sentences in the document.

For parsed_sents(), a list of Tree

objects giving the parse trees for the sentences in the document.

For parsed_paras(), a list of lists of Tree

objects giving the parse trees for the sentences in the document, grouped according to the paragraphs in the document.

For otoks(), a character vector with the orthographic word tokens in the document.

Arguments

x

a text document object.

...

further arguments to be passed to or from methods.

Details

Methods for extracting POS tagged word tokens (i.e., for generics tagged_words(), tagged_sents() and tagged_paras()) can optionally provide a mechanism for mapping the POS tags via a map argument. This can give a function, a named character vector (with names and elements the tags to map from and to, respectively), or a named list of such named character vectors, with names corresponding to POS tagsets (see Universal_POS_tags_map for an example). If a list, the map used will be the element with name matching the POS tagset used (this information is typically determined from the text document metadata; see the the help pages for text document extension classes implementing this mechanism for details).

Text document classes may provide support for representing both (syntactic) words (for which annotations can be provided) and orthographic (word) tokens, e.g., in Spanish dámelo = da me lo. For these, words() gives the syntactic word tokens, and otoks() the orthographic word tokens. This is currently supported for CoNNL-U text documents (see https://universaldependencies.org/format.html for more information) and annotated plain text documents (via word features as used for example for some Stanford CoreNLP annotator pipelines provided by package StanfordCoreNLP available from the repository at https://datacube.wu.ac.at).

In addition to methods for the text document classes provided by package NLP itself, (see TextDocument), package NLP also provides word tokens and POS tagged word tokens for the results of udpipe_annotate() from package udpipe, spacy_parse() from package spacyr, and cnlp_annotate() from package cleanNLP.

See Also

TextDocument for basic information on the text document infrastructure employed by package NLP.

Examples

Run this code
## Example from :
d <- CoNLLUTextDocument(system.file("texts", "spanish.conllu",
                                    package = "NLP"))
content(d)
## To extract the syntactic words:
words(d)
## To extract the orthographic word tokens:
otoks(d)

Run the code above in your browser using DataLab