sequences: find variable-length collocations with filtering

Description

This function automatically identifies contiguous collocations consisting of variable-length term sequences whose frequency is unlikey to have occurred by chance. The algorithm is based on Blaheta and Johnson's "Unsupervised Learning of Multi-Word Verbs".

Usage

sequences(x, features, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, count_min = 2, nested = TRUE)

Arguments

a tokens object

features

a regular expression for filtering the features to be located in sequences

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

count_min

minimum frequency of sequences

nested

collect nested sequences

References

Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

Examples

Run this code

toks <- tokens(corpus_segment(data_corpus_inaugural, what = "sentence"))
toks <- tokens_select(toks, stopwords("english"), "remove", padding = TRUE)

# extracting multi-part proper nouns (capitalized terms)
seqs <- sequences(toks, "^([A-Z][a-z\\-]{2,})", valuetype="regex", case_insensitive = FALSE)
head(seqs, 10)

# types can be any words
seqs2 <- sequences(toks, "^([a-z]+)$", valuetype="regex", case_insensitive = FALSE, count_min = 10)
head(seqs2, 10)

Run the code above in your browser using DataLab