Learn R Programming

quanteda (version 0.9.9-50)

sequences: find variable-length collocations with filtering

Description

This function automatically identifies contiguous collocations consisting of variable-length term sequences whose frequency is unlikely to have occurred by chance. The algorithm is based on Blaheta and Johnson's (2001) "Unsupervised Learning of Multi-Word Verbs".

Usage

sequences(x, features = "*", valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, min_count = 2, max_size = 5, nested = TRUE,
  ordered = FALSE)

is.sequences(x)

Arguments

x
a tokens object
features
a regular expression for filtering the features to be located in sequences
valuetype
how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.
case_insensitive
ignore case when matching, if TRUE
min_count
minimum frequency of sequences for which parameters are estimated
max_size
maxium length of sequences which are collected
nested
if TRUE, collect all the subsequences of a longer sequence as separate entities. e.g. in a sequence of capitalized words "United States Congress", "States Congress" is considered as a subsequence. But "United States" is not a subsequence because it is followed by "Congress".
ordered
if true, use the Blaheta-Johnson method that distinguishes between the order of words, and tends to promote rare sequences.

Value

sequences returns TRUE if the object is of class sequences, FALSE otherwise.

References

Blaheta, D., & Johnson, M. (2001). http://web.science.mq.edu.au/~mjohnson/papers/2001/dpb-colloc01.pdf. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

Examples

Run this code
toks <- tokens(corpus_segment(data_corpus_inaugural, what = "sentence"))
toks <- tokens_select(toks, stopwords("english"), "remove", padding = TRUE)

# extracting multi-part proper nouns (capitalized terms)
seqs <- sequences(toks, "^([A-Z][a-z\\-]{2,})", valuetype="regex", case_insensitive = FALSE)
head(seqs, 10)

# more efficient when applied to the same tokens object 
toks_comp <- tokens_compound(toks, seqs)
toks_comp_ir <- tokens_compound(tokens(data_corpus_irishbudget2010), seqs)

# types can be any words
seqs2 <- sequences(toks, "^([a-z]+)$", valuetype="regex", case_insensitive = FALSE, 
                   min_count = 2, ordered = TRUE)
                   
head(seqs2, 10)

# convert to tokens object
as.tokens(seqs2)

Run the code above in your browser using DataLab