mallet_lda: A wrapper function for LDA using the MALLET machine learning toolkit -- an incredibly efficient, fast and well tested implementation of LDA. See http://mallet.cs.umass.edu/ and https://github.com/mimno/Mallet for much more information on this amazing set of libraries.

Description

A wrapper function for LDA using the MALLET machine learning toolkit -- an incredibly efficient, fast and well tested implementation of LDA. See http://mallet.cs.umass.edu/ and https://github.com/mimno/Mallet for much more information on this amazing set of libraries.

Usage

mallet_lda(documents = NULL, document_directory = NULL,
  document_csv = NULL, vocabulary = NULL, topics = 10,
  iterations = 1000, burnin = 100, alpha = 1, beta = 0.01,
  hyperparameter_optimization_interval = 0, num_top_words = 20,
  optional_arguments = "",
  tokenization_regex = "[\\p{L}\\p{N}\\p{P}]+", stopword_list = NULL,
  cores = 1, delete_intermediate_files = TRUE, memory = "-Xmx10g",
  only_read_in = FALSE, unzip_command = "gunzip -k",
  return_predictive_distribution = TRUE, use_phrases = TRUE)

Arguments

documents

Optional argument for providing the documents we wish to run LDA on. Can be either a character vector with one string per document, a list object where each entry is an (ordered) document-term vector with one list entry per document, a dense document-term matrix where each row represents a document, each column represents a term in the vocabulary, and entries are document-term counts, or a sparse document term matrix (simple triplet matrix from the slam library) -- preferably generated by quanteda::dfm and then converted using convert_quanteda_to_slam().

document_directory

Optional argumnt specifying a directory containing .txt files (one per document) to be used for LDA. May only be used if documents is NULL.

document_csv

Optional argument specifying the path to a csv file containing one document per line. MAy only be used if documents and document_directory are NULL.

vocabulary

An optional character vector (required if the user wishes to not use hyper parameter optimization) specifying the vocabulary. If a (sparse) document term matrix is provided, then this must be the same length as the number of columns in the matrix, and should correspond to those columns.

topics

The number of topics the user wishes to specify for LDA. Defaults to 10.

iterations

The number of collapsed Gibbs sampling iterations the user wishes to specify. Defaults to 1000.

burnin

The number of iterations to be discarded before assesing topic model convergences via a Geweke test. Must be less than iterations. Not a parameter passed to MALLET, only used for post-hoc convergence checking. Defualts to 100.

alpha

The alpha LDA hyperparameter. Defaults to 1.

beta

The beta LDA hyperparameter. This value is multiplied by the size of the vocabulary. Defaults to 0.01 which has worked well for the author in the past.

hyperparameter_optimization_interval

The interval (number of iterations) at which LDA hyper-parameters should be optimized. Defaults to 0 -- meaning no hyper parameter optimization will be performed. If greater than zero, the beta term need not be specified as it will be optimized regardless. Generally a value of 5-10 works well and hyper parameter optimization will often provide much better quality topics.

num_top_words

The number of topic top-words returned in the model output. Defaults to 20.

optional_arguments

Allows the user to specify a string with additional arguments for MALLET.

tokenization_regex

Regular expression used for tokenization by MALLET. Defaults to '[\pL\pN\pP]+' meaning that all letters, numbers and punctuation will be counted as tokens. May be adapted by the user, but double escaping (\) must be used by the user due to the way that escaping is removed by R when piping to the console. Another perfectly reasonable choice is '[\pL]+', which only counts letters in tokens.

stopword_list

Defaults to NULL. If not NULL, then a vector of terms to be removed from the input text should be provided. Only implmeneted when supplying the documents argument.

cores

Number of cores to be used to train the topic model. Defualts to 1.

delete_intermediate_files

Defaults to TRUE. If FALSE, then all raw ouput from MALLET will be left in a "./mallet_intermediate_files" subdirectory of the current working directory.

memory

The amount of Java heap space to be allocated to MALLET. Defaults to '-Xmx10g', indicating 10GB of RAM will be allocated (at maximum). Users may increase this limit if they are working with an exceptionally large corpus.

only_read_in

Defaults to FALSE. If TRUE, then the function only attempts to read back in files from the completed MALLET run. This can be useful if there was an error reading back in the topic reports (usually due to some sort of weird symbols getting in).

unzip_command

Defaults to "gunzip -k", which should work on a mac. This command should be able to unzip a .txt.gz file and keep the original input as a backup, which is what the "-k" option does here.

return_predictive_distribution

Defaults to TRUE, but can be set to FALSE if using a large coprus on a computer with relatively less RAM.

use_phrases

Defaults to TRUE. When TRUE, the topic phrase reports are returned. If FALSE, they are excluded.

Value

Returns a list object with the following fields: lda_trace_stats is a data frame reporting the beta hyperparameter value and model log likelihood per token every ten iterations, can be useful for assesing convergence; document_topic_proportions reports the document topic proportions for all topics; topic_metadata reports the alpha x basemeasure values for all topics, along with the total number of tokens assigned to each topic; topic_top_words reports the 'num_top_words' top words for each topic (in descending order); topic_top_word_counts reports the count of each top word in their respective topics; topic_top_phrases reports top phrases (as found post-hoc by MALLET) asscoiated with each topic; topic_top_phrase_counts reports the counts of these phrases in each topic.

Examples

Run this code

# NOT RUN {
files <- get_file_paths(source = "test sparse doc-term")

sdtm <- generate_sparse_large_document_term_matrix(
   file_list = files,
   maximum_vocabulary_size = -1,
   using_document_term_counts = TRUE)

test <- mallet_lda(documents = sdtm,
                  topics = 10,
                  iterations = 1000,
                  burnin = 100,
                  alpha = 1,
                  beta = 0.01,
                  hyperparameter_optimization_interval = 5,
                  cores = 1)
# }

Run the code above in your browser using DataLab