Learn R Programming

lda (version 1.1)

lda.collapsed.gibbs.sampler: Functions to Fit LDA-type models

Description

These functions use a collapsed Gibbs sampler to fit three different models: latent Dirichlet allocation (LDA), the mixed-membership stochastic blockmodel (MMSB), and supervised LDA (sLDA). These functions take sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling.

Usage

lda.collapsed.gibbs.sampler(documents, K, vocab, num.iterations, alpha,
eta, initial = NULL, burnin = NULL, compute.log.likelihood = FALSE)

slda.em(documents, K, vocab, num.e.iterations, num.m.iterations, alpha, eta, annotations, params, variance, logistic = FALSE, lambda = 10, method = "sLDA")

mmsb.collapsed.gibbs.sampler(network, K, num.iterations, alpha, beta.prior, initial = NULL, burnin = NULL)

Arguments

documents
A list whose length is equal to the number of documents, D. Each element of documents is an integer matrix with two rows. Each column of documents[[i]] (i.e., document $i$) represents a word occurring in the document.

network
For mmsb.collapsed.gibbs.sampler, a $D \times D$ matrix (coercible as logical) representing the adjacency matrix for the network. Note that elements on the diagonal are ignored.
K
An integer representing the number of topics in the model.
vocab
A character vector specifying the vocabulary words associated with the word indices used in documents.
num.iterations
The number of sweeps of Gibbs sampling over the entire corpus to make.
num.e.iterations
For slda.em, the number of Gibbs sampling sweeps to make over the entire corpus for each iteration of EM.
num.m.iterations
For slda.em, the number of EM iterations to make.
alpha
The scalar value of the Dirichlet hyperparameter for topic proportions.
beta.prior
For mmsb.collapsed.gibbs.sampler, the the beta hyperparameter for each entry of the block relations matrix. This parameter should be a length-2 list whose entries are $K \times K$ matrices. The elements of the two matrices compr
eta
The scalar value of the Dirichlet hyperparamater for topic multinomials.
initial
A list of initial topic assignments for words. It should be in the same format as the assignments field of the return value. If this field is NULL, then the sampler will be initialized with random assignments.
burnin
A scalar integer indicating the number of Gibbs sweeps to consider as burn-in (i.e., throw away) for lda.collapsed.gibbs.sampler and mmsb.collapsed.gibbs.sampler. If this parameter is non-NULL, it will also have the
compute.log.likelihood
A scalar logical which when TRUE will cause the sampler to compute the log likelihood of the words (to within a constant factor) after each sweep over the variables. The log likelihood for each iteration is stored in the log.likel
annotations
A length D numeric vector of covariates associated with each document. Only used by slda.em which models documents along with numeric annotations associated with each document.
params
For slda.em, a length K numeric vector of regression coefficients at which the EM algorithm should be initialized.
variance
For slda.em, the variance associated with the Gaussian response modeling the annotations in annotations.
logistic
For slda.em, a scalar logical which, when TRUE, causes the annoatations to be modeled using a logistic response instead of a Gaussian (the covariates will be coerced as logicals).
lambda
Currently unused.
method
For slda.em, a character indicating how to model the annotations. Only "sLDA", the stock model given in the references, is officially supported at the moment.

Value

  • A fitted model as a list with the following components:
  • assignmentsA list of length D. Each element of the list, say assignments[[i]] is an integer vector of the same length as the number of columns in documents[[i]] indicating the topic assignment for each word.
  • topicsA $K \times V$ matrix where each entry indicates the number of times a word (column) was assigned to a topic (row). The column names should correspond to the vocabulary words given in vocab.
  • topic_sumsA length K vector where each entry indicates the total number of times words were assigned to each topic.
  • document_sumsA $K \times D$ matrix where each entry is an integer indicating the number of times words in each document (column) were assigned to each topic (column).
  • log.likelihoodsOnly for lda.collapsed.gibbs.sampler. A length num.iterations vector of log likelihoods when the flag compute.log.likelihood is set to TRUE.
  • document_expectsThis field only exists if burnin is non-NULL. This field is like document_sums but instead of only aggregating counts for the last iteration, this field aggegates counts over all iterations after burnin.
  • net.assignments.leftOnly for mmsb.collapsed.gibbs.sampler. A $D \times D$ integer matrix of topic assignments for the source document corresponding to the link between one document (row) and another (column).
  • net.assignments.rightOnly for mmsb.collapsed.gibbs.sampler. A $D \times D$ integer matrix of topic assignments for the destination document corresponding to the link between one document (row) and another (column).
  • blocks.negOnly for mmsb.collapsed.gibbs.sampler. A $K \times K$ integer matrix indicating the number of times the source of a non-link was assigned to a topic (row) and the destination was assigned to another (column).
  • blocks.posOnly for mmsb.collapsed.gibbs.sampler. A $K \times K$ integer matrix indicating the number of times the source of a link was assigned to a topic (row) and the destination was assigned to another (column).
  • modelFor slda.em, a model of type lm, the regression model fitted to the annotations.
  • coefsFor slda.em, a length K numeric vector of coefficients for the regression model.

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

Airoldi , Edoardo M. and Blei, David M. and Fienberg, Stephen E. and Xing, Eric P. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 2008.

Blei, David M. and McAuliffe, John. Supervised topic models. Advances in Neural Information Processing Systems, 2008.

Griffiths, Thomas L. and Steyvers, Mark. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004.

See Also

read.documents and lexicalize can be used to generate the input data to these models.

top.topic.words and predictive.distribution for operations on the fitted models.

Examples

Run this code
## See demos for the three functions:

demo(lda)

demo(slda)

demo(mmsb)

Run the code above in your browser using DataLab