Estimation of the Structural Topic Model using semi-collapsed variational
EM. The function takes sparse representation of a document-term matrix, an integer
number of topics, and covariates and returns fitted model parameters.
Covariates can be used in the prior for topic prevalence
, in the
prior for topical content
or both. See an overview of functions in
the package here: stm-package
stm(documents, vocab, K, prevalence = NULL, content = NULL, data = NULL,
init.type = c("Spectral", "LDA", "Random", "Custom"), seed = NULL,
max.em.its = 500, emtol = 1e-05, verbose = TRUE, reportevery = 5,
LDAbeta = TRUE, interactions = TRUE, ngroups = 1, model = NULL,
gamma.prior = c("Pooled", "L1"), sigma.prior = 0, kappa.prior = c("L1",
"Jeffreys"), control = list())
The document term matrix to be modeled. These can be supplied
in the native stm format, a sparse term count matrix with one row
per document and one column per term, or a
quanteda dfm (document-feature matrix) object.
When using the sparse matrix or quanteda format this will include the
vocabulary and, for quanteda, optionally the metadata. If using the native list format,
the object must be a list of with each element corresponding to a document. Each document is represented
as an integer matrix with two rows, and columns equal to the number of unique
vocabulary words in the document. The first row contains the 1-indexed
vocabulary entry and the second row contains the number of times that term
appears. This is similar to the format in the lda
package
except that (following R convention) the vocabulary is indexed from one. Corpora
can be imported using the reader function and manipulated using the
prepDocuments
. Raw texts can be ingested using
textProcessor
. Note that when using quanteda dfm
directly there may be higher memory use (because the texts and metadata are stored
twice). You can convert from quanteda's format directly to our native format
using the quanteda function convert.
Character vector specifying the words in the corpus in the
order of the vocab indices in documents. Each term in the vocabulary index
must appear at least once in the documents. See prepDocuments
for dropping unused items in the vocabulary. If documents
is a
sparse matrix or quanteda dfm object, then vocab
should not
(and must not) be supplied. It is contained already inside the column
names of the matrix.
Typically a positive integer (of size 2 or greater) representing
the desired number of topics. If init.type="Spectral"
you can also
set K=0
to use the algorithm of Lee and Mimno (2014) to set the
number of topics (although unlike the standard spectral initialization this
is not deterministic). Additional detail on choosing the number of topics
below.
A formula containing a single variable, a factor variable or something which can be coerced to a factor indicating the category of the content variable for each document.
an optional data frame containing the prevalence and/or content covariates. If unspecified the variables are taken from the active environment.
The method of initialization, by default the spectral initialization.
Must be either Latent
Dirichlet Allocation ("LDA"), "Random", "Spectral" or "Custom". See details for more
info. If you want to replicate a previous result, see the argument
seed
. For "Custom" see the format described below under the custom.beta
option of the control
parameters.
Seed for the random number generator. stm
saves the seed
it uses on every run so that any result can be exactly reproduced. When
attempting to reproduce a result with that seed, it should be specified
here.
The maximum number of EM iterations. If convergence has not been met at this point, a message will be printed. If you set this to 0 it will return the initialization.
Convergence tolerance. EM stops when the relative change in
the approximate bound drops below this level. Defaults to .00001. You
can set it to 0 to have the algorithm run max.em.its
number of steps.
See advanced options under control
for more options.
A logical flag indicating whether information should be printed to the screen. During the E-step (iteration over documents) a dot will print each time 1% of the documents are completed. At the end of each iteration the approximate bound will also be printed.
An integer determining the intervals at which labels are printed to the screen during fitting. Defaults to every 5 iterations.
a logical that defaults to TRUE
when there are no
content covariates. When set to FALSE
the model performs SAGE style
topic updates (sparse deviations from a baseline).
a logical that defaults to TRUE
. This
automatically includes interactions between content covariates and the
latent topics. Setting it to FALSE
reduces to a model with no
interactive effects.
Number of groups for memoized inference. See details below.
A prefit model object. By passing an stm
object to this
argument you can restart an existing model. See details for more info.
sets the prior estimation method for the prevalence
covariate model. The default Pooled
options uses Normal prior
distributions with a topic-level pooled variance which is given a moderately
regularizing half-cauchy(1,1) prior. The alternative L1
uses
glmnet
to estimate a grouped penalty between L1-L2. If your code is running
slowly immediately after "Completed E-Step" appears, you may want to switch to the
L1
option. See details below.
a scalar between 0 and 1 which defaults to 0. This sets the strength of regularization towards a diagonalized covariance matrix. Setting the value above 0 can be useful if topics are becoming too highly correlated.
sets the prior estimation for the content covariate
coefficients. The default option is the L1
prior. The second option
is Jeffreys
which is markedly less computationally efficient but is
included for backwards compatability. See details for more information on
computation.
a list of additional advanced parameters. See details.
An object of class STM
The corpus mean of topic prevalence and coefficients
Covariance matrix
List containing the log of the word probabilities for each topic.
The settings file. The Seed object will always contain the seed which can be fed as an argument to recover the model.
The vocabulary vector used.
list of convergence elements including the value of the approximate bound on the marginal likelihood at each step.
Number of Documents by Number of Topics matrix of topic proportions.
Matrix of means for the variational distribution of the multivariate normal latent variables used to calculate theta.
The inverse of the sigma matrix.
The time elapsed in seconds
The version number of the package with which the model was estimated.
This is the main function for estimating a Structural Topic Model (STM). STM is an admixture with covariates in both mixture components. Users provide a corpus of documents and a number of topics. Each word in a document comes from exactly one topic and each document is represented by the proportion of its words that come from each of the K topics. These proportions are found in the N (number of documents) by K (user specified number of topics) theta matrix. Each of the K topics are represented as distributions over words. The K-by-V (number of words in the vocabulary) matrix logbeta contains the natural log of the probability of seeing each word conditional on the topic.
The most important user input in parametric topic models is the number of topics. There is no right answer to the appropriate number of topics. More topics will give more fine-grained representations of the data at the potential cost of being less precisely estimated. The number must be at least 2 which is equivalent to a unidimensional scaling model. For short corpora focused on very specific subject matter (such as survey experiments) 3-10 topics is a useful starting range. For small corpora (a few hundred to a few thousand) 5-50 topics is a good place to start. Beyond these rough guidelines it is application specific. Previous applications in political science with medium sized corpora (10k to 100k documents) have found 60-100 topics to work well. For larger corpora 100 topics is a useful default size. Of course, your mileage may vary.
When init.type="Spectral"
and K=0
the number of topics is set
using the algorithm in Lee and Mimno (2014). See vignette for details. We
emphasize here as we do there that this does not estimate the "true" number
of topics and does not necessarily have any particular statistical
properties for consistently estimating the number of topics. It can however
provide a useful starting point.
The model for topical prevalence includes covariates which the analyst
believes may influence the frequency with which a topic is discussed. This
is specified as a formula which can contain smooth terms using splines or by
using the function s
. The response portion of the formula
should be left blank. See the examples. These variables can include
numeric and factor variables. While including variables of class
Dates
or other non-numeric, non-factor types will work in stm
it may not always work for downstream functions such as
estimateEffect
.
The topical convent covariates are those which affect the way in which a topic is discussed. As currently implemented this must be a single variable which defines a discrete partition of the dataset (each document is in one and only one group). We may relax this in the future. While including more covariates in topical prevalence will rarely affect the speed of the model, including additional levels of the content covariates can make the model much slower to converge. This is due to the model operating in the much higher dimensional space of words in dictionary (which tend to be in the thousands) as opposed to topics.
In addition to the default priors for prevalence, we also make use of the
glmnet
package to allow for penalties between the L1 and L2 norm. In
these settings we estimate a regularization path and then select the optimal
shrinkage parameter using a user-tuneable information criterion. By default
selecting the L1
option will apply the L1 penalty selecting the
optimal shrinkage parameter using AIC. The defaults have been specifically
tuned for the STM but almost all the relevant arguments can be changed
through the control structure below. Changing the gamma.enet
parameters allow the user to choose a mix between the L1 and L2 norms. When
set to 1 (as by default) this is the lasso penalty, when set to 0 its the
ridge penalty. Any value in between is a mixture called the elastic net.
The default prior choice for content covariates is now the L1
option.
This uses an approximation framework developed in Taddy (2013) called
Distributed Multinomial Regression which utilizes a factorized poisson
approximation to the multinomial. See Roberts, Stewart and Airoldi (2014)
for details on the implementation here. This is dramatically faster than
previous versions. The old default setting which uses a Jeffreys prior is
also available.
The argument init.type
allows the user to specify an intialization
method. The default
choice, "Spectral"
, provides a deterministic inialization using the
spectral algorithm given in Arora et al 2014. See Roberts, Stewart and
Tingley (2016) for details and a comparison of different approaches.
Particularly when the number of documents is relatively large we highly
recommend the Spectral algorithm which often performs extremely well. Note
that the random seed plays no role in the spectral initialization as it is
completely deterministic (unless using the K=0
or random projection
settings). When the vocab is larger than 10000 terms we use only the most
frequent 10000 terms in creating the initialization. This may case the
first step of the algorithm to have a very bad value of the objective function
but it should quickly stabilize into a good place. You can tweak the exact
number where this kicks in with the maxV
argument inside control. There
appear to be some cases where numerical instability in the Spectral algorithm
can cause differences across machines (particularly Windows machines for some reason).
It should always give exactly the same answer for a given machine but if you are
seeing different answers on different machines, see https://github.com/bstewart/stm/issues/133
for a longer explanation. The other option "LDA"
which uses a few passes
of a Gibbs sampler is perfectly reproducible across machines as long as the seed is set.
Specifying an integer greater than 1 for the argument ngroups
causes
the corpus to be broken into the specified number of groups. Global updates
are then computed after each group in turn. This approach, called memoized
variational inference in Hughes and Sudderth (2013), can lead to more rapid
convergence when the number of documents is large. Note that the memory
requirements scale linearly with the number of groups so this provides a
tradeoff between memory efficiency and speed. The claim of speed here
is based on the idea that increasing the number of global updates should
help the model find a solution in fewer passes through the document set.
However, itt is worth noting that for any particular case the model need
not converge faster and definitely won't converge to the same location.
This functionality should be considered somewhat experimental and we encourage
users to let us know what their experiences are like here in practice.
Models can now be restarted by passing an STM
object to the argument
model
. This is particularly useful if you run a model to the maximum
iterations and it terminates without converging. Note that all the standard
arguments still need to be passed to the object (including any formulas, the
number of topics, etc.). Be sure to change the max.em.its
argument
or it will simply complete one additional iteration and stop.
You can pass a custom initialization of the beta model parameters to stm
.
The control
argument is a list with named components which can be
used to specify numerous additional computational details. Valid components
include:
tau.maxit
Controls the maximum number of
iterations when estimating the prior for content covariates. When the mode
is Jeffreys
, estimation proceeds by iterating between the kappa
vector corresponding to a particular topic and the associated variance tau
before moving on to the next parameter vector. this controls the maximum
number of iterations. It defaults to NULL
effectively enforcing
convergence. When the mode is L1
this sets the maximum number of
passes in the coordinate descent algorithm and defaults to 1e8.
tau.tol
Sets the convergence tolerance in the optimization
for content covariates. When the mode is Jeffreys
this sets the
convergence tolerance in the iteration between the kappa vector and
variances tau and defaults to 1e-5. With L1
it defaults to 1e-6.
kappa.mstepmaxit
When the mode for content covariate
estimation is Jeffreys
this controls the maximum number of passes
through the sequence of kappa vectors. It defaults to 3. It has no role
under L1
- see tau.maxit
option instead.
kappa.msteptol
When the mode for content covariate estimation
is Jeffreys
this controls the tolerance for convergence (measured by
the L1 norm) for the entire M-step. It is set to .01 by default. This has
no role under mode L1
- see tau.tol
option instead.
fixedintercept
a logical indicating whether in content covariate models the intercept should be fixed to the background distribution. TRUE by default. This only applies when kappa.prior is set to L1. If FALSE the intercept is estimated from the data without penalty. In practice estimated intercepts often push term probabilities to zero, resulting in topics that look more like those in a Dirichlet model- that is, most terms have approximately zero probability with some terms with high probability.
kappa.enet
When using the L1 mode for content
covariates this controls the elastic net mixing parameter. See the argument
alpha
in glmnet
. Value must be between 1 and 0 where 1 is the
lasso penalty (the default) and 0 is the ridge penalty. The closer the
parameter is to zero the less sparse the solution will tend to be.
gamma.enet
Controls the elastic net mixing parameter for the prevalence covariates. See above for a description.
gamma.ic.k
For L1 mode prevalence covariates this controls the
selection of the regularization parameter. We use a generic information criterion
which penalizes complexity by the parameter ic.k
.
When set to 2 (as by default) this results in AIC. When set to log(n)
(where n is the total number of documents in the corpus) this is equivalent to BIC.
Larger numbers will express a preference for sparser (simpler) models.
gamma.maxits
An integer indicating the maximum number of iterations that the prevalence regression variational algorithm can run before erroring out. Defaults to 1000.
nlambda
Controls the length of the regularization path when using L1 mode for content covariates. Defaults to 500. Note that glmnet relies heavily on warm starts and so a high number will often (counter-intuitively) be less costly than a low number. We have chosen a higher default here than the default in the glmnet package and we don't recommend changing it.
lambda.min.ratio
For L1 mode content covariates this controls the explored path of regularization values. This defaults to .0001. Setting higher numbers will result in more sparse solutions. This is here primarily for dealing with convergence issues, if you want to favor selection of sparser solutions see the next argument.
ic.k
For L1 mode content covariates this controls the
selection of the regularization parameter. We use a generic information
criterion which penalizes complexity by the parameter ic.k
. When set
to 2 (as by default) this results in AIC. When set to log(n) (where n is
the total number of words in the corpus) this is equivalent to BIC. Larger
numbers will express a preference for sparser (simpler) models.
nits
Sets the number of iterations for collapsed gibbs sampling in LDA initializations. Defaults to 50
burnin
Sets the burnin for collapsed gibbs sampling in LDA intializations. Defaults to 25
alpha
Sets the prevalence hyperparameter in collapsed gibbs sampling in LDA initializations. Defaults to 50/K
eta
Sets the topic-word hyperparameter in collapsed gibbs sampling in LDa intializations. Defaults to .01
contrast
A logical indicating whether a standard contrast coding should be used for content covariates. Typically this should remain at the default of FALSE.
rp.s
Parameter between 0 and 1 controlling the sparsity of random projections for the spectral initailization. Defaults to .05
rp.p
Dimensionality of the random projections for the spectral initialization. Defaults to 3000.
rp.d.group.size
Controls the size of blocks considered at a time when computing the random projections for the spectral initialization. Defaults to 2000.
SpectralRP
A logical which when
TRUE
turns on the experimental random projections spectral
initialization.
maxV
For spectral initializations this will set the maximum number of words to be used in the initialization. It uses the most frequent words first and then they are reintroduced following initialization. This allows spectral to be used with a large V.
recoverEG
Set to codeTRUE by default. If set to FALSE
will solve the recovery problem in the Spectral algorithm using a downhill simplex
method. See https://github.com/bstewart/stm/issues/133 for more discussion.
allow.neg.change
A logical indicating whether the algorithm is allowed
to declare convergence when the change in the bound has become negative.
Defaults to TRUE
. Set to FALSE
to keep the algorithm from converging
when the bound change is negative. NB: because this is
only an approximation to the lower-bound the change can be negative at times. Right
now this triggers convergence but the final approximate bound might go higher if you
are willing to wait it out. The logic of the default setting is that a negative change
in the bound usually means it is barely moving at all.
custom.beta
If init.type="Custom"
you can pass your own initialization
of the topic-word distributions beta to use as an initialization. Please note that this takes
some care to be sure that it is provided in exactly the right format. The number of topics and
vocab must match exactly. The vocab must be in the same order. The values must not be pathological
(for instance setting the probability of a single word to be 0 under all topics). The beta should be
formatted in the same way as the piece of a returned stm model object stmobj$beta$logbeta
.
It should be a list of length the number of levels of the content covariate. Each element of the list
is a K by V matrix containing the logged word probability conditional on the topic. If you use this
option we recommend that you use max.em.its=0
with the model initialization set to random, inspect
the returned form of stmobj$beta$logbeta
and ensure that it matches your format.
tSNE_init.dims
The K=0 spectral setting uses tSNE to create a low-dimensional projection of the vocab co-occurence matrix. tSNE starts with a PCA projection as an initialization. We actually do the projection outside the tSNE code so we can use a randomized projection approach. We use the 50 dimensional default of the rtsne package. That can be changed here.
tSNE_perplexity
The Rtsne
function in the rtsne package uses a perplexity
parameter. This defaults to 30 and can throw an error when too high. stm
will automatically lower
the parameter for you until it works, but it can also be directly set here.
Roberts, M., Stewart, B., Tingley, D., and Airoldi, E. (2013) "The structural topic model and applied social science." In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. http://goo.gl/uHkXAQ
Roberts M., Stewart, B. and Airoldi, E. (2016) "A model of text for experimentation in the social sciences" Journal of the American Statistical Association.
Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., et al. (2014). Structural topic models for open ended survey responses. American Journal of Political Science, 58(4), 1064-1082. http://goo.gl/0x0tHJ
Roberts, M., Stewart, B., & Tingley, D. (2016). "Navigating the Local Modes of Big Data: The Case of Topic Models. In Data Analytics in Social Science, Government, and Industry." New York: Cambridge University Press.
# NOT RUN {
# }
# NOT RUN {
#An example using the Gadarian data. From Raw text to fitted model using
#textProcessor() which leverages the tm Package
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
set.seed(02138)
mod.out <- stm(out$documents, out$vocab, 3,
prevalence=~treatment + s(pid_rep), data=out$meta)
#The same example using quanteda instead of tm via textProcessor()
#Note this example works with quanteda version 0.9.9-31 and later
require(quanteda)
gadarian_corpus <- corpus(gadarian, text_field = "open.ended.response")
gadarian_dfm <- dfm(gadarian_corpus,
remove = stopwords("english"),
stem = TRUE)
stm_from_dfm <- stm(gadarian_dfm, K = 3, prevalence = ~treatment + s(pid_rep),
data = docvars(gadarian_corpus))
#An example of restarting a model
mod.out <- stm(out$documents, out$vocab, 3, prevalence=~treatment + s(pid_rep),
data=out$meta, max.em.its=5)
mod.out2 <- stm(out$documents, out$vocab, 3, prevalence=~treatment + s(pid_rep),
data=out$meta, model=mod.out, max.em.its=10)
# }
Run the code above in your browser using DataLab