Discards models with the low likelihood values based on a small number of EM iterations (cast net stage), then calculates semantic coherence, exclusivity, and sparsity (based on default STM run using selected convergence criteria) to allow the user to choose between models with high likelihood values.
selectModel(documents, vocab, K, prevalence = NULL, content = NULL,
data = NULL, max.em.its = 100, verbose = TRUE, init.type = "LDA",
emtol = 1e-05, seed = NULL, runs = 50, frexw = 0.7,
net.max.em.its = 2, netverbose = FALSE, M = 10, N = NULL,
to.disk = F, ...)
The documents to be modeled. Object must be a list of with each element corresponding to a document. Each document is represented as an integer matrix with two rows, and columns equal to the number of unique vocabulary words in the document. The first row contains the 1-indexed vocabulary entry and the second row contains the number of times that term appears.
This is similar to the format in the lda package except that
(following R convention) the vocabulary is indexed from one. Corpora can be
imported using the reader function and manipulated using the
prepDocuments
.
Character vector specifying the words in the corpus in the
order of the vocab indices in documents. Each term in the vocabulary index
must appear at least once in the documents. See
prepDocuments
for dropping unused items in the vocabulary.
A positive integer (of size 2 or greater) representing the desired number of topics. Additional detail on choosing the number of topics in details.
A formula object with no response variable or a matrix
containing topic prevalence covariates. Use s()
, ns()
or
bs()
to specify smooth terms. See details for more information.
A formula containing a single variable, a factor variable or something which can be coerced to a factor indicating the category of the content variable for each document.
Dataset which contains prevalence and content covariates.
The maximum number of EM iterations. If convergence has not been met at this point, a message will be printed.
A logical flag indicating whether information should be printed to the screen.
The method of initialization. Must be either Latent Dirichlet Allocation (LDA), Dirichlet Multinomial Regression Topic Model (DMR), a random initialization or a previous STM object.
Convergence tolerance. EM stops when the relative change in the approximate bound drops below this level. Defaults to .001%.
Seed for the random number generator. stm
saves the seed
it uses on every run so that any result can be exactly reproduced. Setting
the seed here simply ensures that the sequence of models will be exactly the
same when respecified. Individual seeds can be retrieved from the component
model objects.
Total number of STM runs used in the cast net stage. Approximately 15 percent of these runs will be used for running a STM until convergence.
Weight used to calculate exclusivity
Maximum EM iterations used when casting the net
Whether verbose should be used when calculating net models.
Number of words used to calculate semantic coherence and exclusivity. Defaults to 10.
Total number of models to retain in the end. Defaults to .2 of runs.
Boolean. If TRUE, each model is saved to disk at the current
directory in a separate RData file. This is most useful if one needs to run
multiSTM()
on a large number of output models.
Additional options described in details of stm.
List of model outputs the user has to choose from. Take the same form as the output from a stm model.
Semantic coherence values for each topic within each model in runout
Exclusivity values for each topic wihtin each model in runout. Only calculated for models without a content covariate
Percent sparsity for the covariate and interaction kappas for models with a content covariate.
# NOT RUN {
# }
# NOT RUN {
temp<-textProcessor(documents=gadarian$open.ended.response, metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
mod.out <- selectModel(docs, vocab, K=3, prevalence=~treatment + s(pid_rep),
data=meta, runs=5)
plotModels(mod.out)
selected<-mod.out$runout[[1]]
# }
Run the code above in your browser using DataLab