Given an annotation object, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas. A data frame with a document id column and token column can be also be given, which allows the user to preprocess and filter the desired tokens to include.
cnlp_utils_tfidf(object, type = c("tfidf", "tf", "idf", "vocab", "all"),
tf_weight = c("lognorm", "binary", "raw", "dnorm"),
idf_weight = c("idf", "smooth", "prob"), min_df = 0.1,
max_df = 0.9, max_features = 10000, doc_var = c("doc_id", "id"),
token_var = "lemma", vocabulary = NULL, doc_set = NULL)cnlp_utils_tf(object, type = "tf", tf_weight = "raw", ...)
either an annotation object or a data frame with
columns equal to the inputs given to
doc_var
and token_var
the desired return type. The options tfidf
,
tf
, and idf
return a list with
the desired matrix, the document ids, and the
vocabulary set. The option all
returns
a list with all three as well as the ids and
vocabulary. For consistency, vocab
all
returns a list but this only contains the ids
and vocabulary set.
the weighting scheme for the term frequency matrix.
The selection lognorm
takes one plus
the log of the raw frequency (or zero if zero),
binary
encodes a zero one matrix
indicating simply whether the token exists at all
in the document, raw
returns raw counts,
and dnorm
uses double normalization.
the weighting scheme for the inverse document
matrix. The selection idf
gives the
logarithm of the simple inverse frequency,
smooth
gives the logarithm of one plus
the simple inverse frequency, and prob
gives the log odds of the the token occurring
in a randomly selected document.
the minimum proportion of documents a token should be in to be included in the vocabulary
the maximum proportion of documents a token should be in to be included in the vocabulary
the maximum number of tokens in the vocabulary
character vector. The name of the column in
object
that contains the document ids,
unless object
is an annotation object,
in which case it's the column of the token
matrix to use as the document id.
character vector. The name of the column in
object
that contains the tokens,
unless object
is an annotation object,
in which case it's the column of the token
matrix to use as the tokens (generally either
lemma
or word
).
character vector. The vocabulary set to use in
constructing the matrices. Will be computed
within the function if set to NULL
. When
supplied, the options min_df
, max_df
,
and max_features
are ignored.
optional character vector of document ids. Useful to
create empty rows in the output matrix for documents
without data in the input. Most users will want to keep
this equal to NULL
, the default, to have the
function compute the document set automatically.
other arguments passed to the base method
a sparse matrix with dimnames or, if "all", a list with elements
tf the term frequency matrix
idf the inverse document frequency matrix
tfidf the product of the tf and idf matrices
vocab a character vector giving the vocabulary used in the function, corresponding to the columns of the matrices
id a vector of the doc ids, corresponding to the rows of the matrices
# NOT RUN {
require(dplyr)
data(obama)
# Top words in the first Obama S.O.T.U., using all tokens
tfidf <- cnlp_utils_tfidf(obama)
vids <- order(tfidf[1,], decreasing = TRUE)[1:10]
colnames(tfidf)[vids]
# Top words, only using non-proper nouns
tfidf <- cnlp_get_token(obama) %>%
filter(pos %in% c("NN", "NNS")) %>%
cnlp_utils_tfidf()
vids <- order(tfidf[1,], decreasing = TRUE)[1:10]
colnames(tfidf)[vids]
# }
Run the code above in your browser using DataLab