cnlp_utils_tfidf: Construct the TF-IDF Matrix from Annotation or Data Frame

Description

Given an annotation object, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas. A data frame with a document id column and token column can be also be given, which allows the user to preprocess and filter the desired tokens to include.

Usage

cnlp_utils_tfidf(object, type = c("tfidf", "tf", "idf", "vocab", "all"),
  tf_weight = c("lognorm", "binary", "raw", "dnorm"),
  idf_weight = c("idf", "smooth", "prob"), min_df = 0.1,
  max_df = 0.9, max_features = 10000, doc_var = c("doc_id", "id"),
  token_var = "lemma", vocabulary = NULL, doc_set = NULL)
cnlp_utils_tf(object, type = "tf", tf_weight = "raw", ...)

Arguments

object

either an annotation object or a data frame with columns equal to the inputs given to doc_var and token_var

type

the desired return type. The options tfidf, tf, and idf return a list with the desired matrix, the document ids, and the vocabulary set. The option all returns a list with all three as well as the ids and vocabulary. For consistency, vocab all returns a list but this only contains the ids and vocabulary set.

tf_weight

the weighting scheme for the term frequency matrix. The selection lognorm takes one plus the log of the raw frequency (or zero if zero), binary encodes a zero one matrix indicating simply whether the token exists at all in the document, raw returns raw counts, and dnorm uses double normalization.

idf_weight

the weighting scheme for the inverse document matrix. The selection idf gives the logarithm of the simple inverse frequency, smooth gives the logarithm of one plus the simple inverse frequency, and prob gives the log odds of the the token occurring in a randomly selected document.

min_df

the minimum proportion of documents a token should be in to be included in the vocabulary

max_df

the maximum proportion of documents a token should be in to be included in the vocabulary

max_features

the maximum number of tokens in the vocabulary

doc_var

character vector. The name of the column in object that contains the document ids, unless object is an annotation object, in which case it's the column of the token matrix to use as the document id.

token_var

character vector. The name of the column in object that contains the tokens, unless object is an annotation object, in which case it's the column of the token matrix to use as the tokens (generally either lemma or word).

vocabulary

character vector. The vocabulary set to use in constructing the matrices. Will be computed within the function if set to NULL. When supplied, the options min_df, max_df, and max_features are ignored.

doc_set

optional character vector of document ids. Useful to create empty rows in the output matrix for documents without data in the input. Most users will want to keep this equal to NULL, the default, to have the function compute the document set automatically.

...

other arguments passed to the base method

Value

a sparse matrix with dimnames or, if "all", a list with elements

tf the term frequency matrix
idf the inverse document frequency matrix
tfidf the product of the tf and idf matrices
vocab a character vector giving the vocabulary used in the function, corresponding to the columns of the matrices
id a vector of the doc ids, corresponding to the rows of the matrices

Examples

Run this code

# NOT RUN {
require(dplyr)
data(obama)

# Top words in the first Obama S.O.T.U., using all tokens
tfidf <- cnlp_utils_tfidf(obama)
vids <- order(tfidf[1,], decreasing = TRUE)[1:10]
colnames(tfidf)[vids]

# Top words, only using non-proper nouns
tfidf <- cnlp_get_token(obama) %>%
  filter(pos %in% c("NN", "NNS")) %>%
  cnlp_utils_tfidf()
vids <- order(tfidf[1,], decreasing = TRUE)[1:10]
colnames(tfidf)[vids]

# }

Run the code above in your browser using DataLab