Learn R Programming

cleanNLP (version 2.3.0)

cnlp_get_token: Access tokens from an annotation object

Description

This function grabs the table of tokens from an annotation object. There is exactly one row for each token found in the raw text. Tokens include words as well as punctuation marks. If include_root is set to TRUE, a token called ROOT is also added to each sentence; it is particularly useful when interacting with the table of dependencies.

Usage

cnlp_get_token(annotation, include_root = FALSE, combine = FALSE,
  remove_na = combine, spaces = FALSE)

Arguments

annotation

an annotation object

include_root

boolean. Should the sentence root be included? Set to FALSE by default.

combine

boolean. Should other tables (dependencies, sentences, and entites) by merge with the tokens? Set to FALSE by default.

remove_na

boolean. Should columns with only non-missing values be removed? This is mostly useful when working with the combine options, and by default is equal to whatever combine is set to.

spaces

should a column be included that gives the number of spaces that should come after the word. Useful for reconstructing the original text.

Value

Returns an object of class c("tbl_df", "tbl", "data.frame") containing one row for every token in the corpus. The root of each sentence is included as its own token.

The returned data frame includes at a minimum the following columns, unless remove_na has been selected in which case only the first four columns are guaranteed to be in the output depending on which annotators were run:

  • "id" - integer. Id of the source document.

  • "sid" - integer. Sentence id, starting from 0.

  • "tid" - integer. Token id, with the root of the sentence starting at 0.

  • "word" - character. Raw word in the input text.

  • "lemma" - character. Lemmatized form the token.

  • "upos" - character. Universal part of speech code.

  • "pos" - character. Language-specific part of speech code; uses the Penn Treebank codes.

  • "cid" - integer. Character offset at the start of the word in the original document.

References

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of HLT-NAACL 2003, pp. 252-259.

Examples

Run this code
# NOT RUN {
data(obama)

# find average sentence length from each address
require(dplyr)
cnlp_get_token(obama) %>%
  group_by(id, sid) %>%
  summarize(sentence_length = max(tid)) %>%
  summarize(avg_sentence_length = mean(sentence_length))
# }

Run the code above in your browser using DataLab