manyTopics: Performs model selection across separate STM's that each assume different numbers of topics.

Description

Works the same as selectModel, except user specifies a range of numbers of topics that they want the model fitted for. For example, models with 5, 10, and 15 topics. Then, for each number of topics, selectModel is run multiple times. The output is then processed through a function that takes a pareto dominant run of the model in terms of exclusivity and semantic coherence. If multiple runs are candidates (i.e., none weakly dominates the others), a single model run is randomly chosen from the set of undominated runs.

Usage

manyTopics(documents, vocab, K, prevalence = NULL, content = NULL,
  data = NULL, max.em.its = 100, verbose = TRUE, init.type = "LDA",
  emtol = 1e-05, seed = NULL, runs = 50, frexw = 0.7,
  net.max.em.its = 2, netverbose = FALSE, M = 10, ...)

Arguments

documents

The documents to be modeled. Object must be a list of with each element corresponding to a document. Each document is represented as an integer matrix with two rows, and columns equal to the number of unique vocabulary words in the document. The first row contains the 1-indexed vocabulary entry and the second row contains the number of times that term appears.

This is similar to the format in the lda package except that (following R convention) the vocabulary is indexed from one. Corpora can be imported using the reader function and manipulated using the prepDocuments.

vocab

Character vector specifying the words in the corpus in the order of the vocab indices in documents. Each term in the vocabulary index must appear at least once in the documents. See prepDocuments for dropping unused items in the vocabulary.

A vector of positive integers representing the desired number of topics for separate runs of selectModel.

prevalence

A formula object with no response variable or a matrix containing topic prevalence covariates. Use s(), ns() or bs() to specify smooth terms. See details for more information.

content

A formula containing a single variable, a factor variable or something which can be coerced to a factor indicating the category of the content variable for each document.

data

Dataset which contains prevalence and content covariates.

max.em.its

The maximum number of EM iterations. If convergence has not been met at this point, a message will be printed.

verbose

A logical flag indicating whether information should be printed to the screen.

init.type

The method of initialization. See stm.

emtol

Convergence tolerance.

seed

Seed for the random number generator. stm saves the seed it uses on every run so that any result can be exactly reproduced. When attempting to reproduce a result with that seed, it should be specified here.

runs

Total number of STM runs used in the cast net stage. Approximately 15 percent of these runs will be used for running a STM until convergence.

frexw

Weight used to calculate exclusivity

net.max.em.its

Maximum EM iterations used when casting the net

netverbose

Whether verbose should be used when calculating net models.

Number of words used to calculate semantic coherence and exclusivity. Defaults to 10.

…

Additional options described in details of stm.

Value

out

List of model outputs the user has to choose from. Take the same form as the output from a stm model.

semcoh

Semantic coherence values for each topic within each model selected for each number of topics.

exclusivity

Exclusivity values for each topic within each model selected. Only calculated for models without a content covariate.

Details

Does not work with models that have a content variable (at this point).

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta

set.seed(02138)
storage<-manyTopics(docs,vocab,K=3:4, prevalence=~treatment + s(pid_rep),data=meta, runs=10)
#This chooses the output, a single run of STM that was selected,
#from the runs of the 3 topic model
t<-storage$out[[1]]
#This chooses the output, a single run of STM that was selected,
#from the runs of the 4 topic model
t<-storage$out[[2]]
#Please note that the way to extract a result for manyTopics is different from selectModel.
# }

Run the code above in your browser using DataLab