ngrams: Create ngrams

Description

Create a set of ngrams (words in sequence) from tokenized text(s)

Usage

ngrams(x, ...)
## S3 method for class 'character':
ngrams(x, n = 2, window = 1, concatenator = "_", ...)
## S3 method for class 'tokenizedTexts':
ngrams(x, n = 2, window = 1, concatenator = "_",
  ...)
skipgrams(x, ...)
## S3 method for class 'character':
skipgrams(x, n = 2, k = 1, concatenator = "_", ...)
## S3 method for class 'tokenizedTexts':
skipgrams(x, n = 2, k = 1, concatenator = "_",
  ...)

Arguments

a tokenizedText object or a character vector of tokens

...

additional arguments passed to mclapply which applies ngram.character() to the tokenizedTexts list object

integer vector specifying the number of elements to be concatenated in each ngram

window

integer vector specifying the adjacency width for tokens forming the ngrams, default is 1 for only immediately neighbouring words

concatenator

character for combining words, default is _ (underscore) character

for skip-grams only, k is the

Value

a tokenizedTexts object consisting a list of character vectors of ngrams, one list element per text, or a character vector if called on a simple character vector

Details

Normally, ngrams will be called through tokenize, but these functions are also exported in case a user wants to perform lower-level ngram construction on tokenized texts.

skipgrams is a wrapper to ngrams that simply passes through a window value of 1:(k+1), conforming to the definition of skip-grams found in Guthrie et al (2006): A $k$ skip-gram is an ngram which is a superset of all ngrams and each $(k-i)$ skipgram until $(k-i)==0$ (which includes 0 skip-grams).

References

http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf{Guthrie, D, B Allison, W Liu, and L Guthrie. 2006. "A Closer Look at Skip-Gram Modelling."}

Examples

Run this code

ngrams(LETTERS, n = 2, window = 2)
ngrams(LETTERS, n = 3, window = 2)
ngrams(LETTERS, n = 3, window = 3)

tokens <- tokenize("the quick brown fox jumped over the lazy dog.",
                   removePunct = TRUE, simplify = TRUE)
ngrams(tokens, n = 1:3)
ngrams(tokens, n = c(2,4), window = 1:2, concatenator = "")

# skipgrams
tokens <- tokenize(toLower("Insurgents killed in ongoing fighting."),
                   removePunct = TRUE, simplify = TRUE)
skipgrams(tokens, n = 2, k = 2, concatenator = "")
skipgrams(tokens, n = 3, k = 2, concatenator = "")

Run the code above in your browser using DataLab