unnest_tokens

Output column to be created as string or symbol.

output

Input column that gets split as string or symbol.
The output/input arguments are passed by expression and support
<a rd-options="rlang" href="/link/quasiquotation?package=tidytext&version=0.2.6&to=rlang" data-mini-rdoc="rlang::quasiquotation">quasiquotation</a>; you can unquote strings and symbols.

input

Unit for tokenizing, or a custom tokenizing function. Built-in
options are "words" (default), "characters", "character_shingles", "ngrams",
"skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets"
(tokenization by word that preserves usernames, hashtags, and URLS ), and
"ptb" (Penn Treebank). If a function, should take a character vector and
return a list of character vectors of the same length.

token

Either "text", "man", "latex", "html", or "xml". If not text,
this uses the hunspell tokenizer, and can tokenize only by "word"

format

Whether to convert tokens to lowercase. If tokens include
URLS (such as with <code>token = "tweets"</code>), such converted URLs may no
longer be correct.

to_lower

Whether original input column should get dropped. Ignored
if the original input and new output column have the same name.

drop

Whether to combine text with newlines first in case tokens
(such as sentences or paragraphs) span multiple lines. If NULL, collapses
when token method is "ngrams", "skip_ngrams", "sentences", "lines",
"paragraphs", or "regex".

collapse

Extra arguments passed on to <a rd-options="tokenizers" href="/link/tokenizers?package=tidytext&version=0.2.6&to=tokenizers" data-mini-rdoc="tokenizers::tokenizers">tokenizers</a>, such
as <code>strip_punct</code> for "words" and "tweets", <code>n</code> and <code>k</code> for
"ngrams" and "skip_ngrams", <code>strip_url</code> for "tweets", and
<code>pattern</code> for "regex".

Split a column into tokens using the tokenizers package, splitting the table
into one-token-per-row. This function supports non-standard evaluation
through the tidyeval framework.

Using tidy data principles can make many text mining tasks easier,
more effective, and consistent with tools already in wide use. Much of the
infrastructure needed for text mining with tidy data frames already exists
in packages like 'dplyr', 'broom', 'tidyr', and 'ggplot2'. In this package,
we provide functions and supporting data sets to allow conversion of text
to and from tidy formats, and to switch seamlessly between tidy tools and
existing text mining packages.

Julia Silge

tidytext

Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Gabriela De Queiroz

Colin Fay

Emil Hvitfeldt

Os Keyes

Kanishka Misra

Tim Mastny

Jeff Erickson

David Robinson

unnest_tokens function

Input column that gets split as string or symbol.
The output/input arguments are passed by expression and support
<a rd-options='rlang' href='quasiquotation'>quasiquotation</a>; you can unquote strings and symbols.

Extra arguments passed on to <a rd-options='tokenizers' href='tokenizers'>tokenizers</a>, such
as <code>strip_punct</code> for "words" and "tweets", <code>n</code> and <code>k</code> for
"ngrams" and "skip_ngrams", <code>strip_url</code> for "tweets", and
<code>pattern</code> for "regex".

unnest_tokens: Split a column into tokens using the tokenizers package

Description

Usage

Arguments

Details

Examples