unnest_ngrams: Wrapper around unnest_tokens for n-grams
Description
These functions are wrappers around unnest_tokens( token = "ngrams" )
and unnest_tokens( token = "skip_ngrams" ) .
Usage
unnest_ngrams(
tbl,
output,
input,
n = 3L,
n_min = n,
ngram_delim = " ",
format = c("text", "man", "latex", "html", "xml"),
to_lower = TRUE,
drop = TRUE,
collapse = NULL,
...
)
unnest_skip_ngrams(
tbl,
output,
input,
n_min = 1,
n = 3,
k = 1,
format = c("text", "man", "latex", "html", "xml"),
to_lower = TRUE,
drop = TRUE,
collapse = NULL,
...
)
Arguments
tbl
A data frame
output
Output column to be created as string or symbol.
input
Input column that gets split as string or symbol.
The output/input arguments are passed by expression and support
quasiquotation; you can unquote strings and symbols.
n
The number of words in the n-gram. This must be an integer greater
than or equal to 1.
n_min
This must be an integer greater than or equal to 1, and less
than or equal to n.
ngram_delim
The separator between words in an n-gram.
format
Either "text", "man", "latex", "html", or "xml". If not text,
this uses the hunspell tokenizer, and can tokenize only by "word"
to_lower
Whether to convert tokens to lowercase. If tokens include
URLS (such as with token = "tweets"), such converted URLs may no
longer be correct.
drop
Whether original input column should get dropped. Ignored
if the original input and new output column have the same name.
collapse
Whether to combine text with newlines first in case tokens
(such as sentences or paragraphs) span multiple lines. If NULL, collapses
when token method is "ngrams", "skip_ngrams", "sentences", "lines",
"paragraphs", or "regex".
# NOT RUN {library(dplyr)
library(janeaustenr)
d <- tibble(txt = prideprejudice)
d %>%
unnest_ngrams(word, txt, n = 2)
d %>%
unnest_skip_ngrams(word, txt, n = 3, k = 1)
# }