as.TermDocumentMatrix: Generate TermDocumentMatrix / DocumentTermMatrix.

Description

Methods to generate the classes TermDocumentMatrix or DocumentTermMatrix as defined in the tm package. There are many text mining applications for document-term matrices. A DocumentTermMatrix is required as input by the topicmodels package, for instance.

Usage

as.TermDocumentMatrix(x, ...)
as.DocumentTermMatrix(x, ...)
# S4 method for character
as.TermDocumentMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)
# S4 method for corpus
as.DocumentTermMatrix(
  x,
  p_attribute,
  s_attribute,
  stoplist = NULL,
  binarize = FALSE,
  verbose = TRUE,
  ...
)
# S4 method for character
as.DocumentTermMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)
# S4 method for bundle
as.TermDocumentMatrix(x, col, p_attribute = NULL, verbose = TRUE, ...)
# S4 method for bundle
as.DocumentTermMatrix(x, col = NULL, p_attribute = NULL, verbose = TRUE, ...)
# S4 method for partition_bundle
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
# S4 method for partition_bundle
as.TermDocumentMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
# S4 method for subcorpus_bundle
as.TermDocumentMatrix(x, p_attribute = NULL, verbose = TRUE, ...)
# S4 method for subcorpus_bundle
as.DocumentTermMatrix(x, p_attribute = NULL, verbose = TRUE, ...)
# S4 method for partition_bundle
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
# S4 method for context
as.DocumentTermMatrix(x, p_attribute, verbose = TRUE, ...)
# S4 method for context
as.TermDocumentMatrix(x, p_attribute, verbose = TRUE, ...)

Value

A TermDocumentMatrix, or a DocumentTermMatrix object. These classes are defined in the tm package, and inherit from the simple_triplet_matrix-class defined in the slam-package.

Arguments

x: A character vector indicating a corpus, or an object of class bundle, or inheriting from class bundle (e.g. partition_bundle).
...: Definitions of s-attribute used for subsetting the corpus, compare partition-method.
p_attribute: A p-attribute counting is be based on.
s_attribute: An s-attribute that defines content of columns, or rows.
verbose: A logial value, whether to output progress messages.
stoplist: A character vector of tokens to exclude from the matrix, as memory efficient way to exclude irrelevant terms early on.
binarize: A logical value. If TRUE, report occurence of term, not absoulte count.
col: The column of data.table in slot stat (if x is a bundle) to use of assembling the matrix.

Author

Andreas Blaette

Details

If x refers to a corpus (i.e. is a length 1 character vector), a TermDocumentMatrix, or DocumentTermMatrix will be generated for subsets of the corpus based on the s_attribute provided. Counts are performed for the p_attribute. Further parameters provided (passed in as ... are interpreted as s-attributes that define a subset of the corpus for splitting it according to s_attribute. If struc values for s_attribute are not unique, the necessary aggregation is performed, slowing things somewhat down.

If x is a bundle or a class inheriting from it, the counts or whatever measure is present in the stat slots (in the column indicated by col) will be turned into the values of the sparse matrix that is generated. A special case is the generation of the sparse matrix based on a partition_bundle that does not yet include counts. In this case, a p_attribute needs to be provided. Then counting will be performed, too.

If x is a partition_bundle, and argument col is not NULL, as TermDocumentMatrix is generated based on the column indicated by col of the data.table with counts in the stat slots of the objects in the bundle. If col is NULL, the p-attribute indicated by p_attribute is decoded, and a count is performed to obtain the values of the resulting TermDocumentMatrix. The same procedure applies to get a DocumentTermMatrix.

If x is a subcorpus_bundle, the p-attribute provided by argument p_attribute is decoded, and a count is performed to obtain the resulting TermDocumentMatrix or DocumentTermMatrix.

Examples

Run this code

# examples not run by default to save time on CRAN test machines
# \donttest{
#' use(pkg = "RcppCWB", corpus = "REUTERS")
 
# enriching partition_bundle explicitly 
tdm <- corpus("REUTERS") %>% 
  partition_bundle(s_attribute = "id") %>% 
  enrich(p_attribute = "word") %>%
  as.TermDocumentMatrix(col = "count")
   
# leave the counting to the as.TermDocumentMatrix-method
tdm <- partition_bundle("REUTERS", s_attribute = "id") %>% 
  as.TermDocumentMatrix(p_attribute = "word", verbose = FALSE)
  
# obtain TermDocumentMatrix directly (fastest option)
tdm <- as.TermDocumentMatrix(
  "REUTERS",
  p_attribute = "word",
  s_attribute = "id",
  verbose = FALSE
)

# workflow using split()
dtm <- corpus("REUTERS") %>%
  split(s_attribute = "id") %>%
  as.TermDocumentMatrix(p_attribute = "word")
# }

Run the code above in your browser using DataLab