findSequences: find sequences of tokens

Description

This function automatically identify sequences of tokens. This algorithm is based on Blaheta and Johnson's “Unsupervised Learning of Multi-Word Verbs”.

Usage

findSequences(x, tokens, count_min, smooth = 0.001, nested = TRUE)

Arguments

tokenizedTexts objects

tokens

types of token in sequuences

count_min

minimum frequency of sequences

smooth

smoothing factor

nested

collect nested sub-sequence

Examples

Run this code

sents <- tokenize(inaugCorpus, what = "sentence", simplify = TRUE)
tokens <- tokenize(sents, removePunct = TRUE)
types <- unique(unlist(tokens))

# Extracting multi-part nouns
types_upper <- types[stringi::stri_detect_regex(types, "^([A-Z][a-z\\-]{2,})")]
seqs <- findSequences(tokens, types_upper, count_min=2)
head(seqs, 20)

# Types can be any words
types_lower <- types[stringi::stri_detect_regex(types, "^([a-z]+)$") & !types %in%stopwords()]
seqs2 <- findSequences(tokens, types_lower, count_min=10)
head(seqs2, 20)

Run the code above in your browser using DataLab