document_term_matrix: Create a document/term matrix from a data.frame with 1 row per document/term

Description

Create a document/term matrix from a data.frame with 1 row per document/term as returned by document_term_frequencies

Usage

document_term_matrix(x, vocabulary, weight = "freq", ...)
# S3 method for data.frame
document_term_matrix(x, vocabulary, weight = "freq", ...)
# S3 method for DocumentTermMatrix
document_term_matrix(x, ...)
# S3 method for TermDocumentMatrix
document_term_matrix(x, ...)
# S3 method for simple_triplet_matrix
document_term_matrix(x, ...)

Arguments

a data.frame with columns doc_id, term and freq indicating how many times a term occurred in that specific document. This is what document_term_frequencies returns. This data.frame will be reshaped to a matrix with 1 row per doc_id, the terms will be put in the columns and the freq in the matrix cells. Note that the column name to use for freq can be set in the weight argument.

vocabulary

a character vector of terms which should be present in the document term matrix even if they did not occur in x

weight

a column of x indicating what to put in the matrix cells. Defaults to 'freq' indicating to use column freq from x to put into the matrix cells

...

further arguments currently not used

Value

an sparse object of class dgCMatrix with in the rows the documents and in the columns the terms containing the frequencies provided in x extended with terms which were not in x but were provided in vocabulary. The rownames of this resulting object contain the doc_id from x

Methods (by class)

data.frame: Construct a document term matrix from a data.frame with columns doc_id, term, freq
DocumentTermMatrix: Convert an object of class DocumentTermMatrix from the tm package to a sparseMatrix
TermDocumentMatrix: Convert an object of class TermDocumentMatrix from the tm package to a sparseMatrix with the documents in the rows and the terms in the columns
simple_triplet_matrix: Convert an object of class simple_triplet_matrix from the slam package to a sparseMatrix

Examples

Run this code

# NOT RUN {
x <- data.frame(doc_id = c(1, 1, 2, 3, 4), 
 term = c("A", "C", "Z", "X", "G"), 
 freq = c(1, 5, 7, 10, 0))
document_term_matrix(x)
document_term_matrix(x, vocabulary = LETTERS)

## Example on larger dataset
data(brussels_reviews_anno)
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")])
dtm <- document_term_matrix(x)
dim(dtm)
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")])
x <- document_term_frequencies_statistics(x)
dtm <- document_term_matrix(x)
dtm <- document_term_matrix(x, weight = "freq")
dtm <- document_term_matrix(x, weight = "tf_idf")
dtm <- document_term_matrix(x, weight = "bm25")

## example showing the vocubulary argument
## allowing you to making sure terms which are not in the data are provided in the resulting dtm
allterms <- unique(x$term)
dtm <- document_term_matrix(head(x, 1000), vocabulary = allterms)

##
## Example adding bigrams/trigrams to the document term matrix
## Mark that this can also be done using ?dtm_cbind
##
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)]
x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)]
x <- document_term_frequencies(x = x, 
                               document = "doc_id", 
                               term = c("token", "token_bigram", "token_trigram"))
dtm <- document_term_matrix(x)
# }