LatentDirichletAllocation: Creates Latent Dirichlet Allocation model.

Description

Creates Latent Dirichlet Allocation model.

Usage

LatentDirichletAllocation
LDA

Format

R6Class object.

Fields

verbose: logical = TRUE whether to display training inforamtion

Usage

For usage details see Methods, Arguments and Examples sections.

lda = LatentDirichletAllocation$new(n_topics, vocabulary,
              doc_topic_prior = 1 / n_topics, topic_word_prior = 1 / n_topics)
lda$fit(x, n_iter, convergence_tol = -1, check_convergence_every_n = 0)
lda$fit_transform(x, n_iter, convergence_tol = -1, check_convergence_every_n = 0)
lda$get_word_vectors()

Methods

$new(n_topics, vocabulary, doc_topic_prior = 1 / n_topics, # alpha topic_word_prior = 1 / n_topics): Constructor for LDA vectors model. For description of arguments see Arguments section.
$fit(x, n_iter, convergence_tol = -1, check_convergence_every_n = 0): fit LDA model to input matrix x
$fit_transform(x, n_iter, convergence_tol = -1, check_convergence_every_n = 0): fit LDA model to input matrix x and transforms input documents to topic space
$transform(x, n_iter = 100, convergence_tol = 0.005, check_convergence_every_n = 1): transforms new documents to topic space
$get_word_vectors(): get word-topic distribution
$plot(...): plot LDA model using https://cran.r-project.org/package=LDAvis package. ... will be passed to LDAvis::createJSON and LDAvis::serVis functions

Arguments

lda: A LDA object
x: An input document-term matrix.
n_topics: integer desired number of latent topics. Also knows as K
vocabulary: vocabulary in a form of character or text2vec_vocab
doc_topic_prior: numeric prior for document-topic multinomial distribution. Also knows as alpha
topic_word_prior: numeric prior for topic-word multinomial distribution. Also knows as eta
n_iter: integer number of Gibbs iterations
convergence_tol: numeric = -1 defines early stopping strategy. We stop fitting when one of two following conditions will be satisfied: (a) we have used all iterations, or (b) perplexity_previous_iter / perplexity_current_iter - 1 < convergence_tol. By default perform all iterations.
check_convergence_every_n: integer Defines frequency of perplexity calculation. In some cases perplexity calculation during LDA fitting can take noticable amount of time. It make sense to do not calculate it at each iteration.

Examples

Run this code

# NOT RUN {
library(text2vec)
data("movie_review")
N = 500
tokens = movie_review$review[1:N] %>% tolower %>% word_tokenizer
it = itoken(tokens, ids = movie_review$id[1:N])
v = create_vocabulary(it) %>%
  prune_vocabulary(term_count_min = 5, doc_proportion_max = 0.2)
dtm = create_dtm(it, vocab_vectorizer(v), 'lda_c')
lda_model = LatentDirichletAllocation$new(n_topics = 10, vocabulary = v,
 doc_topic_prior = 0.1,
 topic_word_prior = 0.1)
 doc_topic_distr = lda_model$fit_transform(dtm, n_iter =20, check_convergence_every_n = 5)
 # run LDAvis visualisation if needed (make sure LDAvis package installed)
 # lda_model$plot()
# }

Run the code above in your browser using DataLab