stm_tidiers: Tidiers for Structural Topic Models from the stm package

Description

Tidy topic models fit by the stm package. The arguments and return values are similar to lda_tidiers.

Usage

# S3 method for STM
tidy(x, matrix = c("beta", "gamma", "theta"),
  log = FALSE, document_names = NULL, ...)
# S3 method for estimateEffect
tidy(x, ...)
# S3 method for STM
augment(x, data, ...)
# S3 method for STM
glance(x, ...)

Arguments

An STM fitted model object from either stm or estimateEffect from the stm package.

matrix

Whether to tidy the beta (per-term-per-topic, default) or gamma/theta (per-document-per-topic) matrix. The stm package calls this the theta matrix, but other topic modeling packages call this gamma.

log

Whether beta/gamma/theta should be on a log scale, default FALSE

document_names

Optional vector of document names for use with per-document-per-topic tidying

...

Extra arguments, not used

data

For augment, the data given to the stm function, either as a dfm from quanteda or as a tidied table with "document" and "term" columns

Value

tidy returns a tidied version of either the beta or gamma matrix if called on an object from stm or a tidied version of the estimated regressions if called on an object from estimateEffect.

augment must be provided a data argument, either a dfm from quanteda or a table containing one row per original document-term pair, such as is returned by tdm_tidiers, containing columns document and term. It returns that same data as a table with an additional column .topic with the topic assignment for that document-term combination.

glance always returns a one-row table, with columns

k: Number of topics in the model
docs: Number of documents in the model
terms: Number of terms in the model
iter: Number of iterations used
alpha: If an LDA model, the parameter of the Dirichlet distribution for topics over documents

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
if (requireNamespace("stm", quietly = TRUE)) {
  library(dplyr)
  library(ggplot2)
  library(stm)
  library(janeaustenr)

  austen_sparse <- austen_books() %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(book, word) %>%
    cast_sparse(book, word, n)
  topic_model <- stm(austen_sparse, K = 12, verbose = FALSE, init.type = "Spectral")

  # tidy the word-topic combinations
  td_beta <- tidy(topic_model)
  td_beta

  # Examine the topics
  td_beta %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    ggplot(aes(term, beta)) +
    geom_col() +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()

  # tidy the document-topic combinations, with optional document names
  td_gamma <- tidy(topic_model, matrix = "gamma",
                   document_names = rownames(austen_sparse))
  td_gamma

  # using stm's gardarianFit, we can tidy the result of a model
  # estimated with covariates
  effects <- estimateEffect(1:3 ~ treatment, gadarianFit, gadarian)
  td_estimate <- tidy(effects)
  td_estimate

}
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples