phrasetotoken: convert phrases into single tokens

Description

Replace multi-word phrases in text(s) with a compound version of the phrases concatenated with concatenator (by default, the "_" character) to form a single token. This prevents tokenization of the phrases during subsequent processing by eliminating the whitespace delimiter.

Usage

phrasetotoken(object, phrases, ...)
"phrasetotoken"(object, phrases, ...)
"phrasetotoken"(object, phrases, ...)
"phrasetotoken"(object, phrases, ...)
"phrasetotoken"(object, phrases, concatenator = "_", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, ...)

Arguments

object

source texts, a character or character vector

phrases

a dictionary object that contains some phrases, defined as multiple words delimited by whitespace, up to 9 words long; or a quanteda collocation object created by collocations

...

additional arguments passed through to core "character,character" method

concatenator

the concatenation character that will connect the words making up the multi-word phrases. The default _ is highly recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation characters, at least those in the Unicode punctuation class [P] will be removed.

valuetype

how to interpret word matching patterns: "glob" for "glob"-style wildcarding, fixed for words as is; "regex" for regular expressions

case_insensitive

if TRUE, ignore case when matching

Value

character or character vector of texts with phrases replaced by compound "words" joined by the concatenator

Examples

Run this code

mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised a taxes: an income tax and a sales tax.")
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
(cw <- phrasetotoken(mytexts, mydict))
dfm(cw, verbose=FALSE)

# when used as a dictionary for dfm creation
mydfm2 <- dfm(cw, dictionary = lapply(mydict, function(x) gsub(" ", "_", x)))
mydfm2
# to pick up "taxes" in the second text, set valuetype = "regex"
mydfm3 <- dfm(cw, dictionary = lapply(mydict, phrasetotoken, mydict),
              valuetype = "regex")
mydfm3
## one more token counted for "tax" than before
# using a dictionary to pre-process multi-word expressions
myDict <- dictionary(list(negative = c("bad* word*", "negative", "awful text"),
                          postiive = c("good stuff", "like? th??")))
txt <- c("I liked this, when we can use bad words, in awful text.",
         "Some damn good stuff, like the text, she likes that too.")
phrasetotoken(txt, myDict)

# on simple text
phrasetotoken("This is a simpler version of multi word expressions.", "multi word expression*")

Run the code above in your browser using DataLab