Learn R Programming

lda (version 1.1)

top.topic.words: Get the Top Words and Documents in Each Topic

Description

This function takes a model fitted using lda.collapsed.gibbs.sampler and returns a matrix of the top words in each topic.

Usage

top.topic.words(topics, num.words = 20, by.score = FALSE)
top.topic.documents(document_sums, num.documents = 20, alpha = 0.1)

Arguments

topics
For top.topic.words, a $K \times V$ matrix where each entry is a numeric proportional to the probability of seeing the word (column) conditioned on topic (row) (this entry is sometimes denoted $\beta_{w,k}$ in the literature, see
num.words
For top.topic.words, the number of top words to return for each topic.
document_sums
For top.topic.documents, a $K \times D$ matrix where each entry is a numeric proportional to the probability of seeing a topic (row) conditioned on the document (column) (this entry is sometimes denoted $\theta_{d,k}$ in the liter
num.documents
For top.topic.documents, the number of top documents to return for each topic.
by.score
If by.score is set to FALSE (default), then words in each topic will be ranked according to probability mass for each word $\beta_{w, k}$. If by.score is TRUE, then words will be ranked accordi
alpha

Value

  • For top.topic.words, a $num.words \times K$ character matrix where each column contains the top words for that topic. For top.topic.documents, a $num.documents \times K$ integer matrix where each column contains the top documents for that topic. The entries in the matrix are column-indexed references into document_sums.

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

See Also

lda.collapsed.gibbs.sampler for the format of topics.

predictive.distribution demonstrates another use for a fitted topic matrix.

Examples

Run this code
## From demo(lda).

data(cora.documents)
data(cora.vocab)

K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  ## Num clusters
                                      cora.vocab,
                                      25,  ## Num iterations
                                      0.1,
                                      0.1) 

## Get the top words in the cluster
top.words <- top.topic.words(result$topics, 5, by.score=TRUE)

## top.words:
##      [,1]             [,2]        [,3]       [,4]            [,5]      
## [1,] "decision"       "network"   "planning" "learning"      "design"  
## [2,] "learning"       "time"      "visual"   "networks"      "logic"   
## [3,] "tree"           "networks"  "model"    "neural"        "search"  
## [4,] "trees"          "algorithm" "memory"   "system"        "learning"
## [5,] "classification" "data"      "system"   "reinforcement" "systems" 
##      [,6]         [,7]       [,8]           [,9]           [,10]      
## [1,] "learning"   "models"   "belief"       "genetic"      "research" 
## [2,] "search"     "networks" "model"        "search"       "reasoning"
## [3,] "crossover"  "bayesian" "theory"       "optimization" "grant"    
## [4,] "algorithm"  "data"     "distribution" "evolutionary" "science"  
## [5,] "complexity" "hidden"   "markov"       "function"     "supported"

Run the code above in your browser using DataLab