Learn R Programming

preText (version 0.6.2)

optimal_k_comparison: Optimal Topic Model k Comparison

Description

Calculate the optimal number of topics for LDA using perplexity for each dfm.

Usage

optimal_k_comparison(cross_validation_train_document_indicies,
  cross_validation_test_document_indicies, dfm_object_list = NULL,
  topics = c(2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100), names = NULL,
  parallel = FALSE, cores = 1, intermediate_file_directory = NULL,
  intermediate_file_names = NULL)

Arguments

cross_validation_train_document_indicies

A list of numeric vectors where the length of the list is equal to the number of splits to be used for cross validation, and each vector contains the numeric indices of documents to be used for training.

cross_validation_test_document_indicies

A list of numeric vectors where the length of the list is equal to the number of splits to be used for cross validation, and each vector contains the numeric indices of documents to be used for testing.

dfm_object_list

An optional list of quanteda dfm() objects. If none are provided, then intermediate files will be used.

topics

A numeric vector containing the numbers of topics to search over. Defaults to `c(2,5,10,20,30,40,50,60,70,80,90,100)`.

names

optional names for each dfm to make downstream interpretation easier. Defaults to NULL.

parallel

Logical indicating whether model fitting should be performed in parallel. Defaults to FALSE.

cores

Defaults to 1, can be set to any number less than or equal to the number of cores on one's computer.

intermediate_file_directory

Optional directory containing Rdata files for each of the factorial preprocessing combinations.

intermediate_file_names

Optional vector of file names for intermediate Rdata files -- one per combination.

Value

A vector containing the optimal k for each dfm.

Examples

Run this code
# NOT RUN {
set.seed(12345)
# load the package
library(preText)
# load in the data
data("UK_Manifestos")
# preprocess data
preprocessed_documents <- factorial_preprocessing(
    UK_Manifestos,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.02,
    verbose = TRUE)
cross_validation_splits <- 10
# create 10 test/train splits
train_inds <- vector(mode = "list", length = cross_validation_splits)
test_inds <- vector(mode = "list", length = cross_validation_splits)
# sample CV indices
for (i in 1:cross_validation_splits) {
    test <- sample(1:length(UK_Manifestos),
                   size = round(length(UK_Manifestos)/5),
                   replace = FALSE)
    train <- 1:length(UK_Manifestos)
    for (j in 1:length(test)) {
        train <- train[-which(train == test[j])]
    }
    train_inds[[i]] <- train
    test_inds[[i]] <- test
}
# get the optimal number of topics (this will take a very long time):
optimal_k <- optimal_k_comparison(
     train_inds,
     test_inds,
     preprocessed_documents$dfm_list,
     topics = c(25,50,75,100,125,150,175,200),
     names  = preprocessed_documents$labels)
# }

Run the code above in your browser using DataLab