tidytext used to offer twin versions of each verb suffixed with an underscore, like dplyr and the main tidyverse packages. These versions had standard evaluation (SE) semantics; rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with tidytext. However, tidytext now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.
bind_tf_idf_(tbl, term, document, n)cast_sparse_(data, row, column, value)
cast_tdm_(data, term, document, value, weighting = tm::weightTf, ...)
cast_dtm_(data, document, term, value, weighting = tm::weightTf, ...)
cast_dfm_(data, document, term, value, ...)
unnest_tokens_(tbl, output, input, token = "words", format = c("text",
"man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE,
collapse = NULL, ...)
A data frame.
Strings giving names of term, document, and count columns.
A tbl
Column name to use as row names in sparse matrix, as string or symbol
Column name to use as column names in sparse matrix, as string or symbol
Column name to use as sparse matrix values (default 1) as string or symbol
The weighting function for the DTM/TDM (default is term-frequency, effectively unweighted)
Extra arguments to pass on to sparseMatrix
Name of columns.
Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.
Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word"
Whether to convert tokens to lowercase. If tokens include
URLS (such as with token = "tweets"
), such converted URLs may no
longer be correct.
Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
Strings giving names of output and input columns.
Unquoting triggers immediate evaluation of its operand and inlines the result within the captured expression. This result can be a value or an expression to be evaluated later with the rest of the argument.