stm_tidiers: Tidiers for Structural Topic Models from the stm package

Description

Tidy topic models fit by the stm package. The arguments and return values are similar to lda_tidiers().

Usage

# S3 method for STM
tidy(
  x,
  matrix = c("beta", "gamma", "theta", "frex", "lift"),
  log = FALSE,
  document_names = NULL,
  ...
)
# S3 method for estimateEffect
tidy(x, ...)
# S3 method for estimateEffect
glance(x, ...)
# S3 method for STM
augment(x, data, ...)
# S3 method for STM
glance(x, ...)

Value

tidy returns a tidied version of either the beta, gamma, FREX, or lift matrix if called on an object from stm::stm(), or a tidied version of the estimated regressions if called on an object from stm::estimateEffect().

glance returns a tibble with exactly one row of model summaries.

augment must be provided a data argument, either a dfm from quanteda or a table containing one row per original document-term pair, such as is returned by tdm_tidiers, containing columns document and term. It returns that same data with an additional column .topic with the topic assignment for that document-term combination.

Arguments

x

An STM fitted model object from either stm::stm() or stm::estimateEffect()

matrix

Which matrix to tidy:

the beta matrix (per-term-per-topic, default)
the gamma/theta matrix (per-document-per-topic); the stm package calls this the theta matrix, but other topic modeling packages call this gamma
the FREX matrix, for words with high frequency and exclusivity
the lift matrix, for words with high lift

log

Whether beta/gamma/theta should be on a log scale, default FALSE

document_names

Optional vector of document names for use with per-document-per-topic tidying

...

Extra arguments for tidying, such as w as used in stm::calcfrex()

data

For augment, the data given to the stm function, either as a dfm from quanteda or as a tidied table with "document" and "term" columns

Examples

Run this code

if (FALSE) { # interactive() || identical(Sys.getenv("IN_PKGDOWN"), "true")
library(dplyr)
library(ggplot2)
library(stm)
library(janeaustenr)

austen_sparse <- austen_books() %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(book, word) %>%
    cast_sparse(book, word, n)
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE)

# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta

# Examine the topics
td_beta %>%
    group_by(topic) %>%
    slice_max(beta, n = 10) %>%
    ungroup() %>%
    ggplot(aes(beta, term)) +
    geom_col() +
    facet_wrap(~ topic, scales = "free")

# high FREX words per topic
tidy(topic_model, matrix = "frex")

# high lift words per topic
tidy(topic_model, matrix = "lift")

# tidy the document-topic combinations, with optional document names
td_gamma <- tidy(topic_model, matrix = "gamma",
                 document_names = rownames(austen_sparse))
td_gamma

# using stm's gardarianFit, we can tidy the result of a model
# estimated with covariates
effects <- estimateEffect(1:3 ~ treatment, gadarianFit, gadarian)
glance(effects)
td_estimate <- tidy(effects)
td_estimate
}

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples