Last chance! 50% off unlimited learning
Sale ends in
Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
textstat_collocations(x, method = "lambda", size = 2, min_count = 2,
smoothing = 0.5, tolower = TRUE, ...)is.collocations(x)
a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with
padding = TRUE
. While identifying collocations for tokens objects is
supported, you will get better results with character or corpus objects due
to relatively imperfect detection of sentence boundaries from texts already
tokenized.
association measure for detecting collocations. Currently this
is limited to "lambda"
. See Details.
integer; the length of the collocations to be scored
numeric; minimum frequency of collocations that will be scored
numeric; a smoothing parameter added to the observed counts (default is 0.5)
logical; if TRUE
, form collocations as lower-cased combinations
textstat_collocations
returns a data.frame of collocations and
their scores and statistics. This consists of the collocations, their
counts, length, and size
is a vector, then count_nested
counts the lower-order collocations
that occur within a higher-order collocation (but this does not affect the
statistics).
is.collocation
returns TRUE
if the object is of class
collocations
, FALSE
otherwise.
Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x
is a tokens object and some tokens have been removed, this should be done
using tokens_remove(x, pattern, padding = TRUE)
so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda
computed for a size = z
is the
Wald lambda
and the Wald
statistic for lambda
as described below.
In detail:
Consider a smoothing
. The
where
Wald test
Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
# NOT RUN {
txts <- data_corpus_inaugural[1:2]
head(cols <- textstat_collocations(txts, size = 2, min_count = 2), 10)
head(cols <- textstat_collocations(txts, size = 3, min_count = 2), 10)
# extracting multi-part proper nouns (capitalized terms)
toks2 <- tokens(data_corpus_inaugural)
toks2 <- tokens_remove(toks2, stopwords("english"), padding = TRUE)
toks2 <- tokens_select(toks2, "^([A-Z][a-z\\-]{2,})", valuetype = "regex",
case_insensitive = FALSE, padding = TRUE)
seqs <- textstat_collocations(toks2, size = 3, tolower = FALSE)
head(seqs, 10)
# vectorized size
txt <- c(". . . . a b c . . a b c . . . c d e",
"a b . . a b . . a b . . a b . a b",
"b c d . . b c . b c . . . b c")
textstat_collocations(txt, size = 2:3)
# }
Run the code above in your browser using DataLab