jsTopics: Pairwise Jensen-Shannon Similarities (Divergences)

Description

Calculates the similarity of all pairwise topic combinations using the Jensen-Shannon Divergence.

Usage

jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)

Arguments

topics

[named matrix] The counts of vocabularies/words (row wise) in topics (column wise).

epsilon

[numeric(1)] Numerical value added to topics to ensure computability. See details. Default is 1e-06.

progress

[logical(1)] Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default is TRUE. If pm.backend is set, parallelization is done and no progress bar will be shown.

pm.backend

[character(1)] One of "multicore", "socket" or "mpi". If pm.backend is set, parallelStart is called before computation is started and parallelStop is called after.

ncpus

[integer(1)] Number of (physical) CPUs to use. If pm.backend is passed, default is determined by availableCores.

Value

[named list] with entries

sims: [lower triangular named matrix] with all pairwise similarities of the given topics.
wordslimit: [integer] = vocabulary size. See jaccardTopics for original purpose.
wordsconsidered: [integer] = vocabulary size. See jaccardTopics for original purpose.
param: [named list] with parameter specifications for type [character(1)] = "Cosine Similarity" and epsilon [numeric(1)]. See above for explanation.

Details

The Jensen-Shannon Similarity for two topics $\bm z_{i}$ and $\bm z_{j}$ is calculated by $$JS(\bm z_{i}, \bm z_{j}) = 1 - \left( KLD\left(\bm p_i, \frac{\bm p_i + \bm p_j}{2}\right) + KLD\left(\bm p_j, \frac{\bm p_i + \bm p_j}{2}\right) \right)/2$$ $$= 1 - KLD(\bm p_i, \bm p_i + \bm p_j)/2 - KLD(\bm p_j, \bm p_i + \bm p_j)/2 - \log(2)$$ with $V$ is the vocabulary size, $\bm p_k = \left(p_k^{(1)}, ..., p_k^{(V)}\right)$, and $p_k^{(v)}$ is the proportion of assignments of the $v$-th word to the $k$-th topic. KLD defines the Kullback-Leibler Divergence calculated by $$KLD(\bm p_{k}, \bm p_{\Sigma}) = \sum_{v=1}^{V} p_k^{(v)} \log{\frac{p_k^{(v)}}{p_{\Sigma}^{(v)}}}.$$

There is an epsilon added to every $n_k^{(v)}$, the count (not proportion) of assignments to ensure computability with respect to zeros.

Examples

Run this code

# NOT RUN {
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
js = jsTopics(topics)
js

sim = getSimilarity(js)
dim(sim)

js1 = jsTopics(topics, epsilon = 1)
sim1 = getSimilarity(js1)
summary((sim1-sim)[lower.tri(sim)])
plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")

# }

Run the code above in your browser using DataLab