This function grabs the table of tokens from an annotation object. There
is exactly one row for each token found in the raw text. Tokens include
words as well as punctuation marks. If include_root
is set to
TRUE
, a token called ROOT
is also added to each sentence;
it is particularly useful when interacting with the table of dependencies.
cnlp_get_token(annotation, include_root = FALSE, combine = FALSE,
remove_na = combine, spaces = FALSE)
an annotation object
boolean. Should the sentence root be included? Set to FALSE by default.
boolean. Should other tables (dependencies, sentences, and entites) by merge with the tokens? Set to FALSE by default.
boolean. Should columns with only non-missing
values be removed? This is mostly useful when
working with the combine options, and by default
is equal to whatever combine
is set to.
should a column be included that gives the number of spaces that should come after the word. Useful for reconstructing the original text.
Returns an object of class c("tbl_df", "tbl", "data.frame")
containing one row for every token in the corpus. The root of each
sentence is included as its own token.
The returned data frame includes at a minimum the following columns,
unless remove_na
has been selected in which case only the
first four columns are guaranteed to be in the output depending on
which annotators were run:
"id" - integer. Id of the source document.
"sid" - integer. Sentence id, starting from 0.
"tid" - integer. Token id, with the root of the sentence starting at 0.
"word" - character. Raw word in the input text.
"lemma" - character. Lemmatized form the token.
"upos" - character. Universal part of speech code.
"pos" - character. Language-specific part of speech code; uses the Penn Treebank codes.
"cid" - integer. Character offset at the start of the word in the original document.
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.
Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of HLT-NAACL 2003, pp. 252-259.
# NOT RUN {
data(obama)
# find average sentence length from each address
require(dplyr)
cnlp_get_token(obama) %>%
group_by(id, sid) %>%
summarize(sentence_length = max(tid)) %>%
summarize(avg_sentence_length = mean(sentence_length))
# }
Run the code above in your browser using DataLab