A wrapper function for LDA using the MALLET machine learning toolkit -- an incredibly efficient, fast and well tested implementation of LDA. See http://mallet.cs.umass.edu/ and https://github.com/mimno/Mallet for much more information on this amazing set of libraries.
mallet_lda(documents = NULL, document_directory = NULL,
document_csv = NULL, vocabulary = NULL, topics = 10,
iterations = 1000, burnin = 100, alpha = 1, beta = 0.01,
hyperparameter_optimization_interval = 0, num_top_words = 20,
optional_arguments = "",
tokenization_regex = "[\\p{L}\\p{N}\\p{P}]+", stopword_list = NULL,
cores = 1, delete_intermediate_files = TRUE, memory = "-Xmx10g",
only_read_in = FALSE, unzip_command = "gunzip -k",
return_predictive_distribution = TRUE, use_phrases = TRUE)
Optional argument for providing the documents we wish to run LDA on. Can be either a character vector with one string per document, a list object where each entry is an (ordered) document-term vector with one list entry per document, a dense document-term matrix where each row represents a document, each column represents a term in the vocabulary, and entries are document-term counts, or a sparse document term matrix (simple triplet matrix from the slam library) -- preferably generated by quanteda::dfm and then converted using convert_quanteda_to_slam().
Optional argumnt specifying a directory containing .txt files (one per document) to be used for LDA. May only be used if documents is NULL.
Optional argument specifying the path to a csv file containing one document per line. MAy only be used if documents and document_directory are NULL.
An optional character vector (required if the user wishes to not use hyper parameter optimization) specifying the vocabulary. If a (sparse) document term matrix is provided, then this must be the same length as the number of columns in the matrix, and should correspond to those columns.
The number of topics the user wishes to specify for LDA. Defaults to 10.
The number of collapsed Gibbs sampling iterations the user wishes to specify. Defaults to 1000.
The number of iterations to be discarded before assesing topic model convergences via a Geweke test. Must be less than iterations. Not a parameter passed to MALLET, only used for post-hoc convergence checking. Defualts to 100.
The alpha LDA hyperparameter. Defaults to 1.
The beta LDA hyperparameter. This value is multiplied by the size of the vocabulary. Defaults to 0.01 which has worked well for the author in the past.
The interval (number of iterations) at which LDA hyper-parameters should be optimized. Defaults to 0 -- meaning no hyper parameter optimization will be performed. If greater than zero, the beta term need not be specified as it will be optimized regardless. Generally a value of 5-10 works well and hyper parameter optimization will often provide much better quality topics.
The number of topic top-words returned in the model output. Defaults to 20.
Allows the user to specify a string with additional arguments for MALLET.
Regular expression used for tokenization by MALLET. Defaults to '[\pL\pN\pP]+' meaning that all letters, numbers and punctuation will be counted as tokens. May be adapted by the user, but double escaping (\) must be used by the user due to the way that escaping is removed by R when piping to the console. Another perfectly reasonable choice is '[\pL]+', which only counts letters in tokens.
Defaults to NULL. If not NULL, then a vector of terms to be removed from the input text should be provided. Only implmeneted when supplying the documents argument.
Number of cores to be used to train the topic model. Defualts to 1.
Defaults to TRUE. If FALSE, then all raw ouput from MALLET will be left in a "./mallet_intermediate_files" subdirectory of the current working directory.
The amount of Java heap space to be allocated to MALLET. Defaults to '-Xmx10g', indicating 10GB of RAM will be allocated (at maximum). Users may increase this limit if they are working with an exceptionally large corpus.
Defaults to FALSE. If TRUE, then the function only attempts to read back in files from the completed MALLET run. This can be useful if there was an error reading back in the topic reports (usually due to some sort of weird symbols getting in).
Defaults to "gunzip -k", which should work on a mac. This command should be able to unzip a .txt.gz file and keep the original input as a backup, which is what the "-k" option does here.
Defaults to TRUE, but can be set to FALSE if using a large coprus on a computer with relatively less RAM.
Defaults to TRUE. When TRUE, the topic phrase reports are returned. If FALSE, they are excluded.
Returns a list object with the following fields: lda_trace_stats is a data frame reporting the beta hyperparameter value and model log likelihood per token every ten iterations, can be useful for assesing convergence; document_topic_proportions reports the document topic proportions for all topics; topic_metadata reports the alpha x basemeasure values for all topics, along with the total number of tokens assigned to each topic; topic_top_words reports the 'num_top_words' top words for each topic (in descending order); topic_top_word_counts reports the count of each top word in their respective topics; topic_top_phrases reports top phrases (as found post-hoc by MALLET) asscoiated with each topic; topic_top_phrase_counts reports the counts of these phrases in each topic.
# NOT RUN {
files <- get_file_paths(source = "test sparse doc-term")
sdtm <- generate_sparse_large_document_term_matrix(
file_list = files,
maximum_vocabulary_size = -1,
using_document_term_counts = TRUE)
test <- mallet_lda(documents = sdtm,
topics = 10,
iterations = 1000,
burnin = 100,
alpha = 1,
beta = 0.01,
hyperparameter_optimization_interval = 5,
cores = 1)
# }
Run the code above in your browser using DataLab