A primarily internal function for calculating FREX words.
We expect most users will use labelTopics
instead.
calcfrex(logbeta, w = 0.5, wordcounts = NULL)
a K by V matrix containing the log probabilities of seeing word v conditional on topic k
a value between 0 and 1 indicating the proportion of the weight assigned to frequency
a vector of word counts. If provided, a James-Stein type shrinkage estimator is applied to stabilize the exclusivity probabilities. This helps with the concern that the rarest words will always be completely exclusive.
FREX attempts to find words which are both frequent in and exclusive to a topic of interest. Balancing these two traits is important as frequent words are often by themselves simply functional words necessary to discuss any topic. While completely exclusive words can be so rare as to not be informative. This accords with a long-running trend in natural language processing which is best exemplified by the Term frequency-Inverse document frequency metric.
Our notion of FREX comes from a paper by Bischof and Airoldi (2012) which proposed a Hierarchical Poisson Deconvolution model. It relies on a known hierarchical structure in the documents and requires a rather complicated estimation scheme. We wanted a metric that would capture their core insight but still be fast to compute.
Bischof and Airoldi consider as a summary for a word's contribution to a topic the harmonic mean of the word's rank in terms of exclusivity and frequency. The harmonic mean is attractive here because it does not allow a high rank along one of the dimensions to compensate for the lower rank in another. Thus words with a high score must be high along both dimensions.
The formula is ' $$FREX = \left(\frac{w}{F} + \frac{1-w}{E}\right)^{-1}$$ where F is the frequency score given by the empirical CDF of the word in it's topic distribution. Exclusivity is calculated by column-normalizing the beta matrix (thus representing the conditional probability of seeing the topic given the word). Then the empirical CDF of the word is computed within the topic. Thus words with high values are those where most of the mass for that word is assigned to the given topic.
For rare words exclusivity will always be very high because there simply aren't many instances of the word.
If wordcounts
are passed, the function will calculate a regularized form of this distribution using a
James-Stein type estimator described in js.estimate
.
Bischof and Airoldi (2012) "Summarizing topical content with word frequency and exclusivity" In Proceedings of the International Conference on Machine Learning.