Learn R Programming

tidytext (version 0.4.2)

stm_tidiers: Tidiers for Structural Topic Models from the stm package

Description

Tidy topic models fit by the stm package. The arguments and return values are similar to lda_tidiers().

Usage

# S3 method for STM
tidy(
  x,
  matrix = c("beta", "gamma", "theta", "frex", "lift"),
  log = FALSE,
  document_names = NULL,
  ...
)

# S3 method for estimateEffect tidy(x, ...)

# S3 method for estimateEffect glance(x, ...)

# S3 method for STM augment(x, data, ...)

# S3 method for STM glance(x, ...)

Value

tidy returns a tidied version of either the beta, gamma, FREX, or lift matrix if called on an object from stm::stm(), or a tidied version of the estimated regressions if called on an object from stm::estimateEffect().

glance returns a tibble with exactly one row of model summaries.

augment must be provided a data argument, either a dfm from quanteda or a table containing one row per original document-term pair, such as is returned by tdm_tidiers, containing columns document and term. It returns that same data with an additional column .topic with the topic assignment for that document-term combination.

Arguments

x

An STM fitted model object from either stm::stm() or stm::estimateEffect()

matrix

Which matrix to tidy:

  • the beta matrix (per-term-per-topic, default)

  • the gamma/theta matrix (per-document-per-topic); the stm package calls this the theta matrix, but other topic modeling packages call this gamma

  • the FREX matrix, for words with high frequency and exclusivity

  • the lift matrix, for words with high lift

log

Whether beta/gamma/theta should be on a log scale, default FALSE

document_names

Optional vector of document names for use with per-document-per-topic tidying

...

Extra arguments for tidying, such as w as used in stm::calcfrex()

data

For augment, the data given to the stm function, either as a dfm from quanteda or as a tidied table with "document" and "term" columns

See Also

lda_tidiers(), stm::calcfrex(), stm::calclift()

Examples

Run this code
if (FALSE) { # interactive() || identical(Sys.getenv("IN_PKGDOWN"), "true")
library(dplyr)
library(ggplot2)
library(stm)
library(janeaustenr)

austen_sparse <- austen_books() %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(book, word) %>%
    cast_sparse(book, word, n)
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE)

# tidy the word-topic combinations
td_beta <- tidy(topic_model)
td_beta

# Examine the topics
td_beta %>%
    group_by(topic) %>%
    slice_max(beta, n = 10) %>%
    ungroup() %>%
    ggplot(aes(beta, term)) +
    geom_col() +
    facet_wrap(~ topic, scales = "free")

# high FREX words per topic
tidy(topic_model, matrix = "frex")

# high lift words per topic
tidy(topic_model, matrix = "lift")

# tidy the document-topic combinations, with optional document names
td_gamma <- tidy(topic_model, matrix = "gamma",
                 document_names = rownames(austen_sparse))
td_gamma

# using stm's gardarianFit, we can tidy the result of a model
# estimated with covariates
effects <- estimateEffect(1:3 ~ treatment, gadarianFit, gadarian)
glance(effects)
td_estimate <- tidy(effects)
td_estimate
}

Run the code above in your browser using DataLab