factorial_preprocessing: A function to perform factorial preprocessing of a corpus of texts into quanteda document-frequency matrices.

Description

Preprocesses a corpus of texts into a document-frequency matrix in 128 different ways.

Usage

factorial_preprocessing(text, use_ngrams = TRUE,
  infrequent_term_threshold = 0.01, parallel = FALSE, cores = 1,
  intermediate_directory = NULL, parameterization_range = NULL,
  return_results = TRUE, verbose = TRUE)

Arguments

text

A vector of strings (one per document) or quanteda corpus object from which we wish to form a document-term matrix.

use_ngrams

Option to extract 1,2, and 3-grams from the text as another potential preprocessing step. Defaults to TRUE.

infrequent_term_threshold

A proportion threshold at which infrequent terms are to be filtered. Defaults to 0.01 (terms that appear in less than 1 percent of documents).

parallel

Logical indicating whether factorial preprocessing should be performed in parallel. Defaults to FALSE.

cores

Defaults to 1, can be set to any number less than or equal to the number of cores on one's computer.

intermediate_directory

Optional path to a directory where each dfm will be saved as an intermediate step. The file names will follow the convention intermediate_dfm_i.Rdata, where i is the index of the combination of preprocessing choices. The function will then attempt to read all of the dfm's back into a list if return_results = TRUE (by default), or simply end the function call if return_results = FALSE. This can be a useful option if the user is preprocessing a corpus that would make a dfm list that was impractical to work with due to its size.

parameterization_range

Defaults to NULL, but can be set to a numeric vector of indexes relating to preprocessing decisions. This can be used to restart large analyses after power failure.

return_results

Defaults to TRUE, can be set to FALSE to prevent an overly large dfm list from being created.

verbose

Logical indicating whether more information should be printed to the screen to let the user know about progress in preprocessing. Defaults to TRUE.

Value

A list object containing permutations of the document-term matrix.

Examples

Run this code

# NOT RUN {
# load the package
library(preText)
# load in the data
data("UK_Manifestos")
# preprocess data
preprocessed_documents <- factorial_preprocessing(
    UK_Manifestos,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.02,
    verbose = TRUE)
# }

Run the code above in your browser using DataLab