predict.LDA_VEM: Predict method for an object of class LDA_VEM or class LDA_Gibbs

Description

Gives either the predictions to which topic a document belongs or the term posteriors by topic indicating which terms are emitted by each topic.
If you provide in newdata a document term matrix for which a document does not contain any text and hence does not have any terms with nonzero entries, the prediction will give as topic prediction NA values (see the examples).

Usage

# S3 method for LDA_VEM
predict(
  object,
  newdata,
  type = c("topics", "terms"),
  min_posterior = -1,
  min_terms = 0,
  labels,
  ...
)
# S3 method for LDA_Gibbs
predict(
  object,
  newdata,
  type = c("topics", "terms"),
  min_posterior = -1,
  min_terms = 0,
  labels,
  ...
)

Value

in case of type = 'topic': a data.table with columns doc_id, topic (the topic number to which the document is assigned to), topic_label (the topic label) topic_prob (the posterior probability score for that topic), topic_probdiff_2nd (the probability score for that topic - the probability score for the 2nd highest topic) and the probability scores for each topic as indicated by topic_labelyourownlabel
n case of type = 'terms': a list of data.frames with columns term and prob, giving the posterior probability that each term is emitted by the topic

Arguments

object: an object of class LDA_VEM or LDA_Gibbs as returned by LDA from the topicmodels package
newdata: a document/term matrix containing data for which to make a prediction
type: either 'topic' or 'terms' for the topic predictions or the term posteriors
min_posterior: numeric in 0-1 range to output only terms emitted by each topic which have a posterior probability equal or higher than min_posterior. Only used if type is 'terms'. Provide -1 if you want to keep all values.
min_terms: integer indicating the minimum number of terms to keep in the output when type is 'terms'. Defaults to 0.
labels: a character vector of the same length as the number of topics in the topic model. Indicating how to label the topics. Only valid for type = 'topic'. Defaults to topic_prob_001 up to topic_prob_999.
...: further arguments passed on to topicmodels::posterior

Examples

Run this code

## Build document/term matrix on dutch nouns
data(brussels_reviews_anno)
data(brussels_reviews)
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("JJ"))
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 10)
dtm <- dtm_remove_tfidf(dtm, top = 100)

if(require(topicmodels)){
## Fit a topicmodel using VEM
library(topicmodels)
mymodel <- LDA(x = dtm, k = 4, method = "VEM")

## Get topic terminology
terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3)
terminology

## Get scores alongside the topic model
dtm <- document_term_matrix(x, vocabulary = mymodel@terms)
scores <- predict(mymodel, newdata = dtm, type = "topics")
scores <- predict(mymodel, newdata = dtm, type = "topics", 
                  labels = c("mylabel1", "xyz", "app-location", "newlabel"))
head(scores)
table(scores$topic)
table(scores$topic_label)
table(scores$topic, exclude = c())
table(scores$topic_label, exclude = c())

## Fit a topicmodel using Gibbs
library(topicmodels)
mymodel <- LDA(x = dtm, k = 4, method = "Gibbs")
terminology <- predict(mymodel, type = "terms", min_posterior = 0.05, min_terms = 3)
scores <- predict(mymodel, type = "topics", newdata = dtm)
} # End of main if statement running only if the required packages are installed

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples