Calculates the euclidean distance (up to a proportionality) between document term distributions and a set of reference distributions.
reference_distribution_distance(category_reference_distribution,
document_term_matrix, inverse_frequency_weighting = TRUE,
large_matrix = FALSE)
A simple_triplet_matrix where each row represents the distribution over terms in a particular category. These can be normalized or raw counts.
A simple_triplet_matrix where each row represents a document and each column, a term in the vocabulary. The columns in both matrices should match up.
If TRUE, then distances are weighted by the inverse of the term's aggregate count in the document term matrix. This means that differences in more frequently occuring terms will have less weight than those for less frequently appearing terms. Defaults to TRUE.
Defaults to FALSE. If TRUE, then a method that is robust to large matrices will be used. Set this if you get an erro of the form: "'i, j, nrow, ncol' invalid type".
A dataframe with distances of each document to each reference distribution. The last column indicates the closest reference distribtuion for each document.