Given annotations, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas.
cnlp_utils_tfidf(
object,
tf_weight = c("lognorm", "binary", "raw", "dnorm"),
idf_weight = c("idf", "smooth", "prob", "uniform"),
min_df = 0.1,
max_df = 0.9,
max_features = 10000,
doc_var = "doc_id",
token_var = "lemma",
vocabulary = NULL,
doc_set = NULL
)cnlp_utils_tf(
object,
tf_weight = "raw",
idf_weight = "uniform",
min_df = 0,
max_df = 1,
max_features = 10000,
doc_var = "doc_id",
token_var = "lemma",
vocabulary = NULL,
doc_set = NULL
)
a sparse matrix with dimnames giving the documents and vocabular.
a data frame containing an identifier for the document
(set with doc_var
) and token (set with
token_var
)
the weighting scheme for the term frequency matrix.
The selection lognorm
takes one plus
the log of the raw frequency (or zero if zero),
binary
encodes a zero one matrix
indicating simply whether the token exists at all
in the document, raw
returns raw counts,
and dnorm
uses double normalization.
the weighting scheme for the inverse document
matrix. The selection idf
gives the
logarithm of the simple inverse frequency,
smooth
gives the logarithm of one plus
the simple inverse frequency, and prob
gives the log odds of the the token occurring
in a randomly selected document. Set to uniform
to return just the term frequencies.
the minimum proportion of documents a token should be in to be included in the vocabulary
the maximum proportion of documents a token should be in to be included in the vocabulary
the maximum number of tokens in the vocabulary
character vector. The name of the column in
object
that contains the document ids. Defaults
to "doc_id".
character vector. The name of the column in
object
that contains the tokens. Defaults to
"lemma".
character vector. The vocabulary set to use in
constructing the matrices. Will be computed
within the function if set to NULL
. When
supplied, the options min_df
, max_df
,
and max_features
are ignored.
optional character vector of document ids. Useful to
create empty rows in the output matrix for documents
without data in the input. Most users will want to keep
this equal to NULL
, the default, to have the
function compute the document set automatically.