document_term_frequencies_statistics: Add Term Frequency, Inverse Document Frequency and Okapi BM25 statistics to the output of document_term_frequencies
Description
Term frequency Inverse Document Frequency (tfidf) is calculated as the multiplication of
Term Frequency (tf): how many times the word occurs in the document / how many words are in the document
Inverse Document Frequency (idf): log(number of documents / number of documents where the term appears)
The Okapi BM25 statistic is calculated as the multiplication of the inverse document frequency
and the weighted term frequency as defined at https://en.wikipedia.org/wiki/Okapi_BM25.
Usage
document_term_frequencies_statistics(x, k = 1.2, b = 0.75)
Arguments
x
a data.table as returned by document_term_frequencies containing the columns doc_id, term and freq.
# NOT RUN {data(brussels_reviews_anno)
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")])
x <- document_term_frequencies_statistics(x)
head(x)
# }