dictionary_dtm: Making DTM/TDM for Groups of Words

Description

A dictionary has several groups of words. Sometimes what we want is not the term frequency of this or that single word, but rather the total sum of words that belong to the same group. Given a dictionary, this function can save you a lot of time because it sums up the frequencies of all groups of words and you do not need to do it manually.

Usage

dictionary_dtm(
  x,
  dictionary,
  type = "dtm",
  simple_sum = FALSE,
  return_dictionary = FALSE,
  checks = TRUE
)

Arguments

an object of class DocumentTermMatrix or TermDocumentMatrix created by corp_or_dtm or tm::DocumentTermMatrix or tm::TermDocumentMatrix. But it can also be a numeric matrix and you have to specify its type, see below.

dictionary

a dictionary telling the function how you group the words. It can be a list, matrix, data.frame or character vector. Please see details for how to set this argument.

type

if x is a matrix, you have to tell whether it represents a document term matrix or a term document matrix. Character starting with "D" or "d" for document term matrix, and that with "T" or "t" for term document matrix. The default is "dtm".

simple_sum

if it is FALSE (default), a DTM/TDM will be returned. If TRUE, you will not see the term frequency of each word in each text. Rather, a numeric vector is returned, each of its element represents the sum of the corresponding group of words in the corpus as a whole.

return_dictionary

if TRUE, a modified dictionary is returned, which only contains words that do exist in the DTM/TDM. The default is FALSE.

checks

The default is TRUE. This will check whether x and dictionary is valid. For dictionary, if the input is not a list of characters, the function will manage to convert. You should not set this to FALSE unless you do believe that your input is OK.

Value

if return_dictionary = FALSE, an object of class DocumentTermMatrix or TermDocumentMatrix is returned; if TRUE, a list is returned, the 1st element is the DTM/TDM, and the 2nd element is a named list of words. However, if simple_sum = TRUE, the DTM/TDM in the above two situations will be replaced by a vector.

Details

The argument dictionary can be set in different ways:

(1) list: if it is a list, each element represents a group of words. The element should be a character vector; if it is not, the function will manage to convert. However, the length of the element should be > 0 and has to contain at least 1 non-NA word.
(2) matrix or data.frame: each entry of the input should be character; if it is not, the function will manage to convert. At least one of the entries should not be NA. Each column (not row) represents a group of words.
(3) character vector: it represents one group of words.
(4) Note: you do not need to worry about two same words existing in the same group, because the function will only count one of them. Neither should you worry about that the words in a certain group do not really exist in the DTM/TDM, because the function will simply ignore those non-existent words. If none of the words of that group exists, the group will still appear in the final result, although the total frequencies of that group are all 0's. By setting return_dictionary = TRUE, you can see which words do exist.

Examples

Run this code

# NOT RUN {
x <- c(
  "Hello, what do you want to drink and eat?", 
  "drink a bottle of milk", 
  "drink a cup of coffee", 
  "drink some water", 
  "eat a cake", 
  "eat a piece of pizza"
)
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
D1 <- list(
  aa <- c("drink", "eat"),
  bb <- c("cake", "pizza"),
  cc <- c("cup", "bottle")
)
y1 <- dictionary_dtm(dtm, D1, return_dictionary = TRUE)
#
# NA, duplicated words, non-existent words, 
# non-character elements do not affect the
# result.
D2 <-list(
  has_na <- c("drink", "eat", NA),
  this_is_factor <- factor(c("cake", "pizza")),
  this_is_duplicated <- c("cup", "bottle", "cup", "bottle"), 
  do_not_exist <- c("tiger", "dream")
)
y2 <- dictionary_dtm(dtm, D2, return_dictionary = TRUE)
#
# You can read into a data.frame 
# dictionary from a csv file.
# Each column represents a group.
D3 <- data.frame(
  aa <- c("drink", "eat", NA, NA),
  bb <- c("cake", "pizza", NA, NA),
  cc <- c("cup", "bottle", NA, NA),
  dd <- c("do", "to", "of", "and")
)
y3 <- dictionary_dtm(dtm, D3, simple_sum = TRUE)
#
# If it is a matrix:
mt <- t(as.matrix(dtm))
y4 <- dictionary_dtm(mt, D3, type = "t", return_dictionary = TRUE)
# }

Run the code above in your browser using DataLab