Learn R Programming

quanteda (version 0.9.8.5)

findSequences: find sequences of tokens

Description

This function automatically identify sequences of tokens. This algorithm is based on Blaheta and Johnson's “Unsupervised Learning of Multi-Word Verbs”.

Usage

findSequences(x, tokens, count_min, smooth = 0.001, nested = TRUE)

Arguments

x
tokenizedTexts objects
tokens
types of token in sequuences
count_min
minimum frequency of sequences
smooth
smoothing factor
nested
collect nested sub-sequence

Examples

Run this code
sents <- tokenize(inaugCorpus, what = "sentence", simplify = TRUE)
tokens <- tokenize(sents, removePunct = TRUE)
tokens <- selectFeatures(tokens, stopwords(), 'remove', padding=TRUE)
types <- unique(unlist(tokens))

# Extracting multi-part nouns
types_upper <- types[stringi::stri_detect_regex(types, "^([A-Z][a-z\\-]{2,})")]
seqs <- findSequences(tokens, types_upper, count_min=2)
head(seqs, 30)

# Types can be any words
types_lower <- types[stringi::stri_detect_regex(types, "^([a-z]+)$") & !types %in%stopwords()]
seqs2 <- findSequences(tokens, types_lower, count_min=3)
head(seqs2, 20)

Run the code above in your browser using DataLab