Learn R Programming

preText

An R package to assess the consequences of text preprocessing decisions.

[getting started with preText vignette].

The paper detailing the procedure can be found at the link below:

  • Matthew J. Denny, and Arthur Spirling (2017). "Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It". [ssrn.com/abstract=2849145]

Installation

The easiest way to do this is to install the package from CRAN via the standard install.packages command:

install.packages("preText")

If you want to get the latest version from GitHub, start by checking out the Requirements for using C++ code with R section in the following tutorial: Using C++ and R code Together with Rcpp. You will likely need to install either Xcode or Rtools depending on whether you are using a Mac or Windows machine before you can install the preText package via GitHub, since it makes use of C++ code.

install.packages("devtools")

Now we can install from Github using the following line:

devtools::install_github("matthewjdenny/preText")

Once the GERGM package is installed, you may access its functionality as you would any other package by calling:

library(preText)

If all went well, you should be able to replicate the steps in the vignette("getting_started").

Basic Usage

The basic functionality of this package is detailed in a vignette, which is [available here]. Beyond this basic functionality the package includes a number of additional utility and analysis functions for exploring and comparing multiple document--term matrices.

Bug Reporting

PLEASE REPORT ANY BUGS OR ERRORS TO mdenny@psu.edu.

Copy Link

Version

Install

install.packages('preText')

Monthly Downloads

58

Version

0.6.2

License

GPL-3

Maintainer

Last Published

January 12th, 2018

Functions in preText (0.6.2)

factorial_preprocessing

A function to perform factorial preprocessing of a corpus of texts into quanteda document-frequency matrices.
mantel_comparison

Ensemble Mantel Tests
mantel_comparison_to_base

Ensemble Mantel Tests
optimal_k_comparison

Optimal Topic Model k Comparison
scaling_comparison

Scaling Comparison.
topic_key_term_plot

Plot Prevalence of Topic Key Terms
UK_Manifestos

Full text of 69 UK party manifestos from 1918-2001.
calculate_prediction_errors

Calculate mean prediction error for preprocessing decisions.
preText_test

preText Test
preprocessing_choice_regression

Preprocessing Choice Regressions
dfm_scaling_test

Comparison of dfms using N-dimensional scaling, with a test for difference from the mean dfm scaled position.
document_position_plots

Document Position Plots
regression_coefficient_plot

Regression Coefficient Plot
remove_infrequent_terms

Remove infrequently occurring terms from quanteda dfm.
preText

preText: Diagnostics to Assess The Effects of Text Preprocessing Decisions
preText_score_plot

preText specification plot
wordfish_rank_plot

Plot of Wordfish rankings of documents
topic_novelty_score

Topic Top-Terms Novelty Score
wordfish_comparison

Wordfish Comparison.