Learn R Programming

tidytext (version 0.3.3)

stm_tidiers: Tidiers for Structural Topic Models from the stm package

Description

Tidy topic models fit by the stm package. The arguments and return values are similar to lda_tidiers.

Usage

# S3 method for STM
tidy(
  x,
  matrix = c("beta", "gamma", "theta"),
  log = FALSE,
  document_names = NULL,
  ...
)

# S3 method for estimateEffect tidy(x, ...)

# S3 method for estimateEffect glance(x, ...)

# S3 method for STM augment(x, data, ...)

# S3 method for STM glance(x, ...)

Value

tidy returns a tidied version of either the beta or gamma matrix if called on an object from stm or a tidied version of the estimated regressions if called on an object from estimateEffect.

glance always returns a one-row table, with columns

k

Number of topics in the model

docs

Number of documents in the model

uncertainty

Uncertainty measure

augment must be provided a data argument, either a dfm from quanteda or a table containing one row per original document-term pair, such as is returned by tdm_tidiers, containing columns document and term. It returns that same data as a table with an additional column .topic with the topic assignment for that document-term combination.

glance always returns a one-row table, with columns

k

Number of topics in the model

docs

Number of documents in the model

terms

Number of terms in the model

iter

Number of iterations used

alpha

If an LDA model, the parameter of the Dirichlet distribution for topics over documents

Arguments

x

An STM fitted model object from either stm or estimateEffect from the stm package.

matrix

Whether to tidy the beta (per-term-per-topic, default) or gamma/theta (per-document-per-topic) matrix. The stm package calls this the theta matrix, but other topic modeling packages call this gamma.

log

Whether beta/gamma/theta should be on a log scale, default FALSE

document_names

Optional vector of document names for use with per-document-per-topic tidying

...

Extra arguments, not used

data

For augment, the data given to the stm function, either as a dfm from quanteda or as a tidied table with "document" and "term" columns

See Also

lda_tidiers

If matrix == "beta" (default), returns a table with one row per topic and term, with columns

topic

Topic, as an integer

term

Term

beta

Probability of a term generated from a topic according to the structural topic model

If matrix == "gamma", returns a table with one row per topic and document, with columns

topic

Topic, as an integer

document

Document name (if given in vector of document_names) or ID as an integer

gamma

Probability of topic given document

If called on an object from estimateEffect, returns a table with columns

topic

Topic, as an integer

term

The term in the model being estimated and tested

estimate

The estimated coefficient

std.error

The standard error from the linear model

statistic

t-statistic

p.value

two-sided p-value

Examples

Run this code

if (FALSE) {
if (requireNamespace("stm", quietly = TRUE)) {
  library(dplyr)
  library(ggplot2)
  library(stm)
  library(janeaustenr)

  austen_sparse <- austen_books() %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    count(book, word) %>%
    cast_sparse(book, word, n)
  topic_model <- stm(austen_sparse, K = 12, verbose = FALSE, init.type = "Spectral")

  # tidy the word-topic combinations
  td_beta <- tidy(topic_model)
  td_beta

  # Examine the topics
  td_beta %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    ggplot(aes(term, beta)) +
    geom_col() +
    facet_wrap(~ topic, scales = "free") +
    coord_flip()

  # tidy the document-topic combinations, with optional document names
  td_gamma <- tidy(topic_model, matrix = "gamma",
                   document_names = rownames(austen_sparse))
  td_gamma

  # using stm's gardarianFit, we can tidy the result of a model
  # estimated with covariates
  effects <- estimateEffect(1:3 ~ treatment, gadarianFit, gadarian)
  glance(effects)
  td_estimate <- tidy(effects)
  td_estimate

}
}

Run the code above in your browser using DataLab