wfm: Word Frequency Matrix

Description

wfm - Generate a word frequency matrix by grouping variable(s). wfdf - Generate a word frequency data frame by grouping variable. wfm.expanded - Expand a word frequency matrix to have multiple rows for each word. wf.combine - Combines words (rows) of a word frequency dataframe (wfdf) together.

Usage

wfm(text.var = NULL, grouping.var = NULL, wfdf = NULL,
    output = "raw", stopwords = NULL, char2space = "~~",
    ...)

  wfdf(text.var, grouping.var = NULL, stopwords = NULL,
    margins = FALSE, output = "raw", digits = 2,
    char2space = "~~", ...)

  wfm.expanded(text.var, grouping.var = NULL, ...)

  wf.combine(wf.obj, word.lists, matrix = FALSE)

Arguments

text.var

The text variable

grouping.var

The grouping variables. Default NULL generates one word list for all text. Also takes a single grouping variable or a list of 1 or more grouping variables.

wfdf

A word frequency data frame given instead of raw text.var and optional grouping.var. Basically converts a word frequency dataframe (wfdf) to a word frequency matrix (wfm). Default is NULL

output

Output type (either "proportion" or "percent").

stopwords

A vector of stop words to remove.

char2space

A vector of characters to be turned into spaces. If char.keep is NULL, char2space will activate this argument.

...

Other arguments supplied to strip.

digits

An integer indicating the number of decimal places (round) or significant digits (signif) to be used. Negative values are allowed.

margins

logical. If TRUE provides grouping.var and word variable totals.

word.lists

A list of character vectors of words to pass to wf.combine

matrix

logical. If TRUE returns the output as a wfm rather than a wfdf object.

wf.obj

A wfm or wfdf object.

Value

wfm - returns a word frequency of the class matrix. wfdf - returns a word frequency of the class data.frame with a words column and optional margin sums. wfm.expanded - returns a matrix similar to a word frequency matrix (wfm) but the rows are expanded to represent the maximum usages of the word and cells are dummy coded to indicate that number of uses. wf.combine - returns a word frequency matrix (wfm) or dataframe (wfdf) with counts for the combined word.lists merged and remaining terms (else).

Examples

Run this code

#word frequency matrix (wfm) example:
with(DATA, wfm(state, list(sex, adult)))[1:15, ]
with(DATA, wfm(state, person))[1:15, ]

#insert double tilde ("~~") to keep phrases(i.e., first last name)
alts <- c(" fun", "I ")
state2 <- mgsub(alts, gsub("\\s", "~~", alts), DATA$state)
with(DATA, wfm(state2, list(sex, adult)))[1:18, ]

#word frequency dataframe (wfdf) example:
with(DATA, wfdf(state, list(sex, adult)))[1:15, ]
with(DATA, wfdf(state, person))[1:15, ]

#insert double tilde ("~~") to keep dual words (i.e., first last name)
alts <- c(" fun", "I ")
state2 <- mgsub(alts, gsub("\\s", "~~", alts), DATA$state)
with(DATA, wfdf(state2, list(sex, adult)))[1:18, ]

#wfm.expanded example:
z <- wfm(DATA$state, DATA$person)
wfm.expanded(z)[30:45, ] #two "you"s

#wf.combine examples:
#===================
#raw no margins (will work)
x <- wfm(DATA$state, DATA$person)

#raw with margin (will work)
y <- wfdf(DATA$state, DATA$person, margins = TRUE)

WL1 <- c(y[, 1])
WL2 <- list(c("read", "the", "a"), c("you", "your", "you're"))
WL3 <- list(bob = c("read", "the", "a"), yous = c("you", "your", "you're"))
WL4 <- list(bob = c("read", "the", "a"), yous = c("a", "you", "your", "your're"))
WL5 <- list(yous = c("you", "your", "your're"))
WL6 <- list(c("you", "your", "your're"))  #no name so will be called words 1
WL7 <- c("you", "your", "your're")

wf.combine(z, WL2) #Won't work not a raw frequency matrix
wf.combine(x, WL2) #Works (raw and no margins)
wf.combine(y, WL2) #Works (raw with margins)
wf.combine(y, c("you", "your", "your're"))
wf.combine(y, WL1)
wf.combine(y, WL3)
## wf.combine(y, WL4) #Error
wf.combine(y, WL5)
wf.combine(y, WL6)
wf.combine(y, WL7)

worlis <- c("you", "it", "it's", "no", "not", "we")
y <- wfdf(DATA$state, list(DATA$sex, DATA$adult), margins = TRUE)
z <- wf.combine(y, worlis, matrix = TRUE)

chisq.test(z)
chisq.test(wfm(wfdf = y))

Run the code above in your browser using DataLab