unnest_sentences

unnest_lines

unnest_paragraphs

Output column to be created as string or symbol.

output

Input column that gets split as string or symbol.
The output/input arguments are passed by expression and support
<a rd-options="rlang" href="/link/quasiquotation?package=tidytext&version=0.2.6&to=rlang" data-mini-rdoc="rlang::quasiquotation">quasiquotation</a>; you can unquote strings and symbols.

input

strip_punct

Either "text", "man", "latex", "html", or "xml". If not text,
this uses the hunspell tokenizer, and can tokenize only by "word"

format

Whether to convert tokens to lowercase. If tokens include
URLS (such as with <code>token = "tweets"</code>), such converted URLs may no
longer be correct.

to_lower

Whether original input column should get dropped. Ignored
if the original input and new output column have the same name.

drop

Whether to combine text with newlines first in case tokens
(such as sentences or paragraphs) span multiple lines. If NULL, collapses
when token method is "ngrams", "skip_ngrams", "sentences", "lines",
"paragraphs", or "regex".

collapse

Extra arguments passed on to <a rd-options="tokenizers" href="/link/tokenizers?package=tidytext&version=0.2.6&to=tokenizers" data-mini-rdoc="tokenizers::tokenizers">tokenizers</a>

A string identifying the boundary between two
paragraphs.

paragraph_break

These functions are wrappers around <code>unnest_tokens( token = "sentences" )</code>
<code>unnest_tokens( token = "lines" )</code> and <code>unnest_tokens( token = "paragraphs" )</code>.

Using tidy data principles can make many text mining tasks easier,
more effective, and consistent with tools already in wide use. Much of the
infrastructure needed for text mining with tidy data frames already exists
in packages like 'dplyr', 'broom', 'tidyr', and 'ggplot2'. In this package,
we provide functions and supporting data sets to allow conversion of text
to and from tidy formats, and to switch seamlessly between tidy tools and
existing text mining packages.

Julia Silge

tidytext

Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Gabriela De Queiroz

Colin Fay

Emil Hvitfeldt

Os Keyes

Kanishka Misra

Tim Mastny

Jeff Erickson

David Robinson

unnest_sentences function

Input column that gets split as string or symbol.
The output/input arguments are passed by expression and support
<a rd-options='rlang' href='quasiquotation'>quasiquotation</a>; you can unquote strings and symbols.

Extra arguments passed on to <a rd-options='tokenizers' href='tokenizers'>tokenizers</a>

unnest_sentences: Wrapper around unnest_tokens for sentences, lines, and paragraphs

Description

Usage

Arguments

See Also

Examples