Last chance! 50% off unlimited learning
Sale ends in
Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with concatenator
(by default, the "_
" character)
to form a single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a dfm.
tokens_compound(x, sequences, concatenator = "_", valuetype = c("glob",
"regex", "fixed"), case_insensitive = TRUE, join = FALSE)
an input tokens object
the input sequence, one of:
character vector, whose elements will be split on whitespace;
list of characters, consisting of a list of token patterns, separated by white space;
tokens object;
dictionary object;
collocations object.
the concatenation character that will connect the words
making up the multi-word sequences. The default _
is highly
recommended since it will not be removed during normal cleaning and
tokenization (while nearly all other punctuation characters, at least those
in the Unicode punctuation class [P] will be removed).
how to interpret keyword expressions: "glob"
for
"glob"-style wildcard expressions; "regex"
for regular expressions;
or "fixed"
for exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching
logical; if TRUE
, join overlapped compounds
a tokens object in which the token sequences matching the patterns
in sequences
have been replaced by compound "tokens" joined by the concatenator
# NOT RUN {
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
"New York City has raised taxes: an income tax and inheritance taxes.")
mytoks <- tokens(mytexts, remove_punct = TRUE)
# for lists of sequence elements
myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax"))
(cw <- tokens_compound(mytoks, myseqs))
dfm(cw)
# when used as a dictionary for dfm creation
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
(cw2 <- tokens_compound(mytoks, mydict))
# to pick up "taxes" in the second text, set valuetype = "regex"
(cw3 <- tokens_compound(mytoks, mydict, valuetype = "regex"))
# dictionaries w/glob matches
myDict <- dictionary(list(negative = c("bad* word*", "negative", "awful text"),
positive = c("good stuff", "like? th??")))
toks <- tokens(c(txt1 = "I liked this, when we can use bad words, in awful text.",
txt2 = "Some damn good stuff, like the text, she likes that too."))
tokens_compound(toks, myDict)
# with collocations
#cols <- textstat_collocations("capital gains taxes are worse than inheritance taxes",
# size = 2, min_count = 1)
#toks <- tokens("The new law included capital gains taxes and inheritance taxes.")
#tokens_compound(toks, cols)
# }
Run the code above in your browser using DataLab