tokens_ngrams: Create n-grams and skip-grams from tokens

Description

Create a set of n-grams (tokens in sequence) from already tokenized text objects, with an optional skip argument to form skip-grams. Both the n-gram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. Implemented in C++ for efficiency.

Usage

tokens_ngrams(
  x,
  n = 2L,
  skip = 0L,
  concatenator = concat(x),
  verbose = quanteda_options("verbose")
)
char_ngrams(x, n = 2L, skip = 0L, concatenator = "_")
tokens_skipgrams(
  x,
  n,
  skip,
  concatenator = concat(x),
  verbose = quanteda_options("verbose")
)

Value

a tokens object consisting a list of character vectors of n-grams, one list element per text, or a character vector if called on a simple character vector

Arguments

x: a tokens object, or a character vector, or a list of characters
n: integer vector specifying the number of elements to be concatenated in each n-gram. Each element of this vector will define a \(n\) in the \(n\)-gram(s) that are produced.
skip: integer vector specifying the adjacency skip size for tokens forming the n-grams, default is 0 for only immediately neighbouring words. For skipgrams, skip can be a vector of integers, as the "classic" approach to forming skip-grams is to set skip = \(k\) where \(k\) is the distance for which \(k\) or fewer skips are used to construct the \(n\)-gram. Thus a "4-skip-n-gram" defined as skip = 0:4 produces results that include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (where 0 skips are typical n-grams formed from adjacent words). See Guthrie et al (2006).
concatenator: character for combining words, default is _ (underscore) character
verbose: if TRUE print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Details

Normally, these functions will be called through [tokens](x, ngrams = , ...), but these functions are provided in case a user wants to perform lower-level n-gram construction on tokenized texts.

tokens_skipgrams() is a wrapper to tokens_ngrams() that requires arguments to be supplied for both n and skip. For \(k\)-skip skip-grams, set skip to 0:\(k\), in order to conform to the definition of skip-grams found in Guthrie et al (2006): A \(k\) skip-gram is an n-gram which is a superset of all n-grams and each \((k-i)\) skip-gram until \((k-i)==0\) (which includes 0 skip-grams).

References

Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006. "A Closer Look at Skip-Gram Modelling." https://aclanthology.org/L06-1210/

Examples

Run this code

# ngrams
tokens_ngrams(tokens(c("a b c d e", "c d e f g")), n = 2:3)

toks <- tokens(c(text1 = "the quick brown fox jumped over the lazy dog"))
tokens_ngrams(toks, n = 1:3)
tokens_ngrams(toks, n = c(2,4), concatenator = " ")
tokens_ngrams(toks, n = c(2,4), skip = 1, concatenator = " ")
# skipgrams
toks <- tokens("insurgents killed in ongoing fighting")
tokens_skipgrams(toks, n = 2, skip = 0:1, concatenator = " ")
tokens_skipgrams(toks, n = 2, skip = 0:2, concatenator = " ")
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")

Run the code above in your browser using DataLab