unnest_ngrams

unnest_skip_ngrams

Output column to be created as string or symbol.

output

Input column that gets split as string or symbol.
The output/input arguments are passed by expression and support
<a rd-options="rlang" href="/link/quasiquotation?package=tidytext&version=0.2.6&to=rlang" data-mini-rdoc="rlang::quasiquotation">quasiquotation</a>; you can unquote strings and symbols.

input

The number of words in the n-gram. This must be an integer greater
than or equal to 1.

This must be an integer greater than or equal to 1, and less
than or equal to <code>n</code>.

n_min

The separator between words in an n-gram.

ngram_delim

Either "text", "man", "latex", "html", or "xml". If not text,
this uses the hunspell tokenizer, and can tokenize only by "word"

format

Whether to convert tokens to lowercase. If tokens include
URLS (such as with <code>token = "tweets"</code>), such converted URLs may no
longer be correct.

to_lower

Whether original input column should get dropped. Ignored
if the original input and new output column have the same name.

drop

Whether to combine text with newlines first in case tokens
(such as sentences or paragraphs) span multiple lines. If NULL, collapses
when token method is "ngrams", "skip_ngrams", "sentences", "lines",
"paragraphs", or "regex".

collapse

Extra arguments passed on to <a rd-options="tokenizers" href="/link/tokenizers?package=tidytext&version=0.2.6&to=tokenizers" data-mini-rdoc="tokenizers::tokenizers">tokenizers</a>

For the skip n-gram tokenizer, the maximum skip distance between
words. The function will compute all skip n-grams between <code>0</code> and
<code>k</code>.

These functions are wrappers around <code>unnest_tokens( token = "ngrams" )</code>
and <code>unnest_tokens( token = "skip_ngrams" )</code> .

Using tidy data principles can make many text mining tasks easier,
more effective, and consistent with tools already in wide use. Much of the
infrastructure needed for text mining with tidy data frames already exists
in packages like 'dplyr', 'broom', 'tidyr', and 'ggplot2'. In this package,
we provide functions and supporting data sets to allow conversion of text
to and from tidy formats, and to switch seamlessly between tidy tools and
existing text mining packages.

Julia Silge

tidytext

Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Gabriela De Queiroz

Colin Fay

Emil Hvitfeldt

Os Keyes

Kanishka Misra

Tim Mastny

Jeff Erickson

David Robinson

unnest_ngrams function

Input column that gets split as string or symbol.
The output/input arguments are passed by expression and support
<a rd-options='rlang' href='quasiquotation'>quasiquotation</a>; you can unquote strings and symbols.

Extra arguments passed on to <a rd-options='tokenizers' href='tokenizers'>tokenizers</a>

unnest_ngrams: Wrapper around unnest_tokens for n-grams

Description

Usage

Arguments

See Also

Examples