Learn R Programming

quanteda (version 0.8.4-2)

ngrams: Create ngrams

Description

Create a set of ngrams (words in sequence) from tokenized text(s)

Usage

ngrams(x, ...)

## S3 method for class 'character': ngrams(x, n = 2, window = 1, concatenator = "_", ...)

## S3 method for class 'tokenizedTexts': ngrams(x, n = 2, window = 1, concatenator = "_", ...)

skipgrams(x, ...)

## S3 method for class 'character': skipgrams(x, n = 2, k = 1, concatenator = "_", ...)

## S3 method for class 'tokenizedTexts': skipgrams(x, n = 2, k = 1, concatenator = "_", ...)

Arguments

x
a tokenizedText object or a character vector of tokens
...
additional arguments passed to mclapply which applies ngram.character() to the tokenizedTexts list object
n
integer vector specifying the number of elements to be concatenated in each ngram
window
integer vector specifying the adjacency width for tokens forming the ngrams, default is 1 for only immediately neighbouring words
concatenator
character for combining words, default is _ (underscore) character
k
for skip-grams only, k is the

Value

  • a tokenizedTexts object consisting a list of character vectors of ngrams, one list element per text, or a character vector if called on a simple character vector

Details

Normally, ngrams will be called through tokenize, but these functions are also exported in case a user wants to perform lower-level ngram construction on tokenized texts.

skipgrams is a wrapper to ngrams that simply passes through a window value of 1:(k+1), conforming to the definition of skip-grams found in Guthrie et al (2006): A $k$ skip-gram is an ngram which is a superset of all ngrams and each $(k-i)$ skipgram until $(k-i)==0$ (which includes 0 skip-grams).

References

http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf{Guthrie, D, B Allison, W Liu, and L Guthrie. 2006. "A Closer Look at Skip-Gram Modelling."}

Examples

Run this code
ngrams(LETTERS, n = 2, window = 2)
ngrams(LETTERS, n = 3, window = 2)
ngrams(LETTERS, n = 3, window = 3)

tokens <- tokenize("the quick brown fox jumped over the lazy dog.",
                   removePunct = TRUE, simplify = TRUE)
ngrams(tokens, n = 1:3)
ngrams(tokens, n = c(2,4), window = 1:2, concatenator = "")

# skipgrams
tokens <- tokenize(toLower("Insurgents killed in ongoing fighting."),
                   removePunct = TRUE, simplify = TRUE)
skipgrams(tokens, n = 2, k = 2, concatenator = "")
skipgrams(tokens, n = 3, k = 2, concatenator = "")

Run the code above in your browser using DataLab