multiSTM: Analyze Stability of Local STM Mode

Description

This function performs a suite of tests aimed at assessing the global behavior of an STM model, which may have multiple modes. The function takes in a collection of differently initialized STM fitted objects and selects a reference model against which all others are benchmarked for stability. The function returns an output of S3 class 'MultimodDiagnostic', with associated plotting methods for quick inspection of the test results.

Usage

multiSTM(
  mod.out = NULL,
  ref.model = NULL,
  align.global = FALSE,
  mass.threshold = 1,
  reg.formula = NULL,
  metadata = NULL,
  reg.nsims = 100,
  reg.parameter.index = 2,
  verbose = TRUE,
  from.disk = FALSE
)

Value

An object of 'MultimodDiagnostic' S3 class, consisting of a list with the following components:

N: The number of fitted models in the list of model outputs that was supplied to the function for the purpose of stability analysis.
K: The number of topics in the models.
glob.max: The index of the reference model in the list of model outputs (mod.out) that was supplied to the function. The reference model is selected as the one with the maximum bound value at convergence.
lb: A list of the maximum bound value at convergence for each of the fitted models in the list of model outputs. The list has length N.
lmat: A K-by-N matrix reporting the L1-distance of each topic from the corresponding one in the reference model. This is defined as: $$L_{1}=\sum_{v}|\beta_{k,v}^{ref}-\beta_{k,v}^{cand}|$$ Where the beta matrices are the topic-word matrices for the reference and the candidate model.
tmat: A K-by-N matrix reporting the number of "top documents" shared by the reference model and the candidate model. The "top documents" for a given topic are defined as the 10 documents in the reference corpus with highest topical frequency.
wmat: A K-by-N matrix reporting the number of "top words" shared by the reference model and the candidate model. The "top words" for a given topic are defined as the 10 highest-frequency words.
lmod: A vector of length N consisting of the row sums of the lmat matrix.
tmod: A vector of length N consisting of the row sums of the tmat matrix.
wmod: A vector of length N consisting of the row sums of the wmat matrix.
semcoh: Semantic coherence values for each topic within each model in the list of model outputs.
L1mat: A K-by-N matrix reporting the limited-mass L1-distance of each topic from the corresponding one in the reference model. Similar to lmat, but computed using only the top portion of the probability mass for each topic, as specified by the mass.threshol parameter. NULL if mass.treshold==1.
L1mod: A vector of length N consisting of the row means of the L1mat matrix.
mass.threshold: The mass threshold argument that was supplied to the function.
cov.effects: A list of length N containing the output of the run of estimateEffect() on each candidate model with the given regression formula. NULL if no regression formula is given.
var.matrix: A K-by-N matrix containing the estimated variance for each of the fitted regression parameters. NULL if no regression formula is given.
confidence.ratings: A vector of length N, where each entry specifies the proportion of regression coefficient estimates in a candidate model that fall within the .95 confidence interval for the corresponding estimate in the reference model.
align.global: The alignment control argument that was supplied to the function.
reg.formula: The regression formula that was supplied to the function.
reg.nsims: The reg.nsims argument that was supplied to the function.
reg.parameter.index: The reg.parameter.index argument that was supplied to the function.

Arguments

mod.out: The output of a selectModel() run. This is a list of model outputs the user has to choose from, which all take the same form as the output from a STM model. Currently only works with models without content covariates.
ref.model: An integer referencing the element of the list in mod.out which contains the desired reference model. When set to the default value of NULL this chooses the model with the largest value of the approximate variational bound.
align.global: A boolean parameter specifying how to align the topics of two different STM fitted models. The alignment is performed by solving the linear sum assignment problem using the Hungarian algorithm. If align.global is set to TRUE, the Hungarian algorithm is run globally on the topic-word matrices of the two models that are being compared. The rows of the matrices are aligned such as to minimize the sum of their inner products. This results in each topic in the current runout being matched to a unique topic in the reference model. If align.global is, conversely, set to FALSE, the alignment problem is solved locally. Each topic in the current runout is matched to the one topic in the reference models that yields minimum inner product. This means that multiple topics in the current runout can be matched to a single topic in the reference model, and does not guarantee that all the topics in the reference model will be matched.
mass.threshold: A parameter specifying the portion of the probability mass of topics to be used for model analysis. The tail of the probability mass is disregarded accordingly. If mass.threshold is different from 1, both the full-mass and partial-mass analyses are carried out.
reg.formula: A formula for estimating a regression for each model in the ensemble, where the documents are the units, the outcome is the proportion of each document about a topic in an STM model, and the covariates are the document-level metadata. The formula should have an integer or a vector of numbers on the left-hand side, and an equation with covariates on the right-hand side. If the left-hand side is left blank, the regression is performed on all topics in the model. The formula is exclusively used for building calls to estimateEffect(), so see the documentation for estimateEffect() for greater detail about the regression procedure. If reg.formula is null, the covariate effect stability analysis routines are not performed. The regressions incorporate uncertainty by using an approximation to the average covariance matrix formed using the global parameters.
metadata: A dataframe where the predictor variables in reg.formula can be found. It is necessary to include this argument if reg.formula is specified.
reg.nsims: The number of simulated draws from the variational posterior for each call of estimateEffect(). Defaults to 100.
reg.parameter.index: If reg.formula is specified, the function analyzes the stability across runs of the regression coefficient for one particular predictor variable. This argument specifies which predictor variable is to be analyzed. A value of 1 corresponds to the intercept, a value of 2 correspond to the first predictor variable in reg.formula, and so on. Support for multiple concurrent covariate effect stability analyses is forthcoming.
verbose: If set to TRUE, the function will report progress.
from.disk: If set to TRUE, multiSTM() will load the input models from disk rather than from RAM. This option is particularly useful for dealing with large numbers of models, and is intended to be used in conjunction with the to.disk option of selectModel(). multiSTM() inspects the current directory for RData files.

Author

Antonio Coppola (Harvard University), Brandon Stewart (Princeton University), Dustin Tingley (Harvard University)

Details

The purpose of this function is to automate and generalize the stability analysis routines for topic models that are introduced in Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley: "Navigating the Local Modes of Big Data: The Case of Topic Models" (2014). For more detailed discussion regarding the background and motivation for multimodality analysis, please refer to the original article. See also the documentation for plot.MultimodDiagnostic for help with the plotting methods associated with this function.

References

Roberts, M., Stewart, B., & Tingley, D. (2016). "Navigating the Local Modes of Big Data: The Case of Topic Models. In Data Analytics in Social Science, Government, and Industry." New York: Cambridge University Press.

Examples

Run this code


if (FALSE) {

# Example using Gadarian data
temp<-textProcessor(documents=gadarian$open.ended.response, 
                    metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
mod.out <- selectModel(docs, vocab, K=3, 
                       prevalence=~treatment + s(pid_rep), 
                       data=meta, runs=20)

out <- multiSTM(mod.out, mass.threshold = .75, 
                reg.formula = ~ treatment,
                metadata = gadarian)
plot(out)

# Same example as above, but loading from disk
mod.out <- selectModel(docs, vocab, K=3, 
                       prevalence=~treatment + s(pid_rep), 
                       data=meta, runs=20, to.disk=T)

out <- multiSTM(from.disk=T, mass.threshold = .75, 
                reg.formula = ~ treatment,
                metadata = gadarian)
}

Run the code above in your browser using DataLab