Learn R Programming

tidytext (version 0.2.2)

bind_tf_idf_: Deprecated SE version of functions

Description

tidytext used to offer twin versions of each verb suffixed with an underscore, like dplyr and the main tidyverse packages. These versions had standard evaluation (SE) semantics; rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with tidytext. However, tidytext now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.

Usage

bind_tf_idf_(tbl, term, document, n)

cast_sparse_(data, row, column, value)

cast_tdm_(data, term, document, value, weighting = tm::weightTf, ...)

cast_dtm_(data, document, term, value, weighting = tm::weightTf, ...)

cast_dfm_(data, document, term, value, ...)

unnest_tokens_(tbl, output, input, token = "words", format = c("text", "man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE, collapse = NULL, ...)

Arguments

tbl

A data frame.

term, document, n

Strings giving names of term, document, and count columns.

data

A tbl

row

Column name to use as row names in sparse matrix, as string or symbol

column

Column name to use as column names in sparse matrix, as string or symbol

value

Column name to use as sparse matrix values (default 1) as string or symbol

weighting

The weighting function for the DTM/TDM (default is term-frequency, effectively unweighted)

...

Extra arguments to pass on to sparseMatrix

output, input

Name of columns.

token

Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.

format

Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word"

to_lower

Whether to convert tokens to lowercase. If tokens include URLS (such as with token = "tweets"), such converted URLs may no longer be correct.

drop

Whether original input column should get dropped. Ignored if the original input and new output column have the same name.

collapse

Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".

output_col, input_col

Strings giving names of output and input columns.

Details

Unquoting triggers immediate evaluation of its operand and inlines the result within the captured expression. This result can be a value or an expression to be evaluated later with the rest of the argument.