top.topic.words: Get the Top Words and Documents in Each Topic

Description

This function takes a model fitted using lda.collapsed.gibbs.sampler and returns a matrix of the top words in each topic.

Usage

top.topic.words(topics, num.words = 20, by.score = FALSE)
top.topic.documents(document_sums, num.documents = 20, alpha = 0.1)

Arguments

topics

For top.topic.words, a $K \times V$ matrix where each entry is a numeric proportional to the probability of seeing the word (column) conditioned on topic (row) (this entry is sometimes denoted $\beta_{w,k}$ in the literature, see

num.words

For top.topic.words, the number of top words to return for each topic.

document_sums

For top.topic.documents, a $K \times D$ matrix where each entry is a numeric proportional to the probability of seeing a topic (row) conditioned on the document (column) (this entry is sometimes denoted $\theta_{d,k}$ in the liter

num.documents

For top.topic.documents, the number of top documents to return for each topic.

by.score

If by.score is set to FALSE (default), then words in each topic will be ranked according to probability mass for each word $\beta_{w, k}$. If by.score is TRUE, then words will be ranked accordi

alpha

Value

For top.topic.words, a $num.words \times K$ character matrix where each column contains the top words for that topic. For top.topic.documents, a $num.documents \times K$ integer matrix where each column contains the top documents for that topic. The entries in the matrix are column-indexed references into document_sums.

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

Examples

Run this code

## From demo(lda).

data(cora.documents)
data(cora.vocab)

K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  ## Num clusters
                                      cora.vocab,
                                      25,  ## Num iterations
                                      0.1,
                                      0.1) 

## Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5, by.score=TRUE)

## top.words:
##      [,1]             [,2]        [,3]       [,4]            [,5]      
## [1,] "decision"       "network"   "planning" "learning"      "design"  
## [2,] "learning"       "time"      "visual"   "networks"      "logic"   
## [3,] "tree"           "networks"  "model"    "neural"        "search"  
## [4,] "trees"          "algorithm" "memory"   "system"        "learning"
## [5,] "classification" "data"      "system"   "reinforcement" "systems" 
##      [,6]         [,7]       [,8]           [,9]           [,10]      
## [1,] "learning"   "models"   "belief"       "genetic"      "research" 
## [2,] "search"     "networks" "model"        "search"       "reasoning"
## [3,] "crossover"  "bayesian" "theory"       "optimization" "grant"    
## [4,] "algorithm"  "data"     "distribution" "evolutionary" "science"  
## [5,] "complexity" "hidden"   "markov"       "function"     "supported"