A function to generate a sparse large document term matrix in blocks from a list document term vector lists stored as .Rdata object on disk. This function is designed to work on very large corpora (up to 10's of billions of words) that would otherwise be computationally intractable to generate a document term matrix for using standard methods. However, this function, and R itself, is limited to a vocaublary size of roughly 2.1 billion unique words.
generate_sparse_large_document_term_matrix(file_list, file_directory = NULL,
vocabulary = NULL, maximum_vocabulary_size = -1,
using_document_term_counts = FALSE, generate_sparse_term_matrix = TRUE,
parallel = FALSE, cores = 1, large_vocabulary = FALSE,
term_frequency_threshold = 0, save_vocabulary_to_file = FALSE)
A character vector of paths to intermediate files prefferably generated by the generate_document_term_vector_list() function, that reside in the file_directory or have their full path specified.
The directory where you have stored a series of intermediate .Rdata files, each of which contains an R list object named "document_term_vector_list" which is a list of document term vectors. This can most easily be generated by the generate_document_term_vector_list() function. Defaults to NULL, in which case the current working directory will be used. This argument can also be left as NULL if the full path to the intermediate files you are using is provided.
If we already know the aggregate vocabulary, then it can be provided as a string vector. When providing this vector it will be mush more computationally efficient to provide it order from most frequently appearing words to least frequently appearing ones for computational efficiency. Defaults to NULL in which case the vocabulary will be determined inside the function. The list object saved automatically in the Vocabulary.Rdata file in the file_directory may be provided (after first loading it into memory). This is the memory optimized object saved automatically if generate_sparse_term_matrix == FALSE.
An integer specifying the maximum number of unique word types you expect to encounter. Defaults to -1 in which case the maximum vocabulary size used for pre-allocation in finding the common vocabular across all documents will be set to approximately the number of words in all documents. If you beleive this number to be over 2 billion, or are memory limited on your computer it is recommended to set this to some lower number. For normal english words, a value of 10 million should be sufficient. If you are dealing with n-grams then somewhere in the neighborhood of 100 million to 1 billion is often appropriate. If you have reason to believe that your final vocabulary size will be over ~2,147,000,000 then you should considder working in C++ or rolling your own functions, and congratuations, you have really large text data.
Defaults to FALSE, if TRUE then we epect a document_term_count_list for each chunk. See generate_document_term_matrix() for more information.
Defaults to TRUE. If FALSE, then the function only generates and saves the aggregate vocabulary (and counts) in the form of a list object named Aggregate_Vocabular_and_Counts.Rdata in file_directory or the current working directory if file_directory = NULL. This option is useful if we have an extremely large corpus and may wnat to trim the vocabulary first before providing an aggregate_vocabulary.
Defaults to FALSE, but can be set to TRUE to speed up processing provided the machine hte user is using has enough RAM. Parallelization is currently implemented using forking in the parallel package (mclapply) so it will only work on UNIX based platforms..
Defaults to 1. Can be set to the number of cores on your computer.
Defaults to FALSE. If the user believes their vocabulary to be greater than ~500,000 unique terms, specifying true may result in a substantial reduction in compute time. If TRUE, then the program implements a stemming lookup table to efficiently index terms in the vocabulary. This option only works with parallel = TRUE and is meant to accomodate vocabulary sizes up to several hundred million unique terms.
The number of times a term must appear in the corpus or it will be removed. Defaults to 0. 5 is a reasonable choice, and higher numbers will speed computation by reducing vocabulary size.
Defaults to FALSE. If TRUE, the the vocabulary file you generate will be saved to disk so that the process can be restarted later.
A sparse document term matrix object. This will likely still be a large file.