Learn R Programming

SpeedReader

An R package that provides functions to facilitate high performance text analysis in R.

Overview

This package provides a number of functions to:

  • A front end for Stanford's CoreNLP libraries for POS tagging and finding named entities.
  • Term-category association analyses including PMI and TF-IDF, with various forms of weighting.
  • A front end for topic modeling using MALLET, that also reads the results back into R and presents them in a series of data.frames.
  • A set of methods to compare documents and document versions using sequences of n-grams, and ensembles of Dice coefficients.
  • An implementation of the informed Dirichlet model from Monroe et al. (2008), along with publication quality funnel plots.
  • Functions for forming complex contingency tables.
  • Functions of displaying text in LaTeX tables.
  • Functionality to read in a preprocess text data into a document-term matrix.

The unifying theme of these functions is that they are designed to be easy to use, and to operate on up to tens of billions of tokens over hundreds of millions of documents without requiring a massive map-reduce cluster with terabytes of RAM. I have decided to produce an R package since these are functions I use quite frequently andthey have been replicated in several projects. Check out the early version of the package vignette, avialable here!

Installation

Requirements for using C++ code with R

Note that if you are using a Mac, you will need to start by making sure you have Xcode + developer tools installed or you will not be able to compile the C++ code that is used in the samplers for this package. You will need to go here: https://developer.apple.com/xcode/downloads/ and then select the link to the additional downloads page which will prompt you to enter you apple ID. This will let you download the developer tools. This requirement is not unique to this package, but is necessary for all packages that use Rcpp.

If you are using a Windows machine, you will need to make sure you have the latest release of R (3.2.0+) and will also need to install Rtools (v33 or higher, available here http://cran.r-project.org/bin/windows/Rtools/) before you can use any packages with C++ code in them. It is also highly advised that you use RStudio to download and install the package as it seems to play nicer with Rcpp under Windows. You may also want to visit this blog post which has more information on making C++ work with R under Windows.

If you are using a Linux distro, make sure you have a C++ complier installed, but in general, you should not run into as many issues.

More generally, I suggest you check out this tutorial on using C++ with R. It goes over some of the code used in this package and also covers a number of potential prolems you might run into when trying to compile C++ code on your computer, so it is a good reference.

Installing The Package

To install this package from Github, you will need to Hadley Wickham's devtools package installed.

install.packages("devtools")
library("devtools")

Now we can install from Github using the following line:

devtools::install_github("matthewjdenny/SpeedReader")

I have had success installing with R 3.2.0+ installed but please email me if you hit any issues.

Functions

The SpeedReader package currently provides the following functions to aid in the preprocessing of large text corpora.

  • generate_document_term_vectors() -- A function to ingest raw text data, either as .txt files, as R objects with one string per document, as R objects with a term vector per document, or as csv/tsv files with a column of unique words and (optionally) their counts. If providing raw text, cleaning and tokenization is currently provided using the included clean_document_text() function which makes use of regular expressions, but cleaning and NER will eventually be provided using Standford's CoreNLP libraries.
  • generate_blocked_document_term_vectors() -- A function to automate generating and saving to disk blocks of documents for corpora that are too large to fit in memory. Automatically formats data for downstream use in large scale text manipulation functions.
  • count_words() -- A function to count words in a provided document term vector list. Has the option to continue adding to a previously generated vocabulary/count object.
  • generate_document_term_matrix() -- A function to generate a document term matrix from a term-vector list object returned by generate_document_term_vectors(). Provides lots of options and will automatically generate a vocabulary if none is provided. Provides and option to return a sparse document-term matrix.
  • generate_sparse_large_document_term_matrix() -- The main function provided by the package. Will generate very large (sparse) document term matrices from very large vocabularies, in parallel, in a memory efficient manner.
  • sparse_to_dense_matrix() -- A helpful function for converting sparse matrix objects to dense matrix objects. Use with caution on large sparse matrices!
  • tfidf() -- Calculates and displays TF-IDF scores for a given document term matrix.
  • contingency_table() -- Generates a contingency table for a given document term matrix and set of document covariates.
  • pmi() -- Calculates a number of information theoretic quantities on a given contingency table.
  • corenlp() -- A wrapper for Stanfords wonderful CoreNLP libraries. Currently returns one dataframe per document with lots of CoreNLP token metadata including POS and NER tags. Also wraps syntactic parsing, and coreference resolution functionality as options.
  • mallet_lda() -- A wrapper for the incredibly efficient, robust and well tested implementation of latent Dirichlet allocation included in the MALLET libraries. Reads all output into R for easy reuse in other applications.
  • feature_selection() -- Allows the user to perform feature selection on a contingency table using a number of different formulations of TF-IDF as well as the informed Dirichlet model from the Monroe et al. "Fightin' Words" paper.
  • fightin_words_plot() -- Makes really nice looking funnel plots similar to those in the Monroe et al. "Fightin' Words" paper from the output of the feature_selection() function.
  • calculate_document_pair_distances() calculates cosine distances between pairs of documents for a given document-term matrix.
  • dice_coefficient_line_matching() -- Uses Dice coefficents calculated on token Bigrams to determine the number of lines/sentences in document 1 that are also in document 2 (based on some Dice coefficient threshold) and vice versa.
  • document_similarities() -- Calculates sequence based document similarity metrics (more details forthcoming). The implementation is extremely efficient and parallelizable, and can perform billions of document comparisons per day on a moderately sized HPC allocation (~40 cores).

The SpeedReader package also provides the following utility functions:

  • unlist_and_concatenate() -- A function to un-list and concatenate a subset of a matrix/data.frame
  • order_by_counts() -- A function to generate an ordered word count dataframe from a raw vector of words.
  • multi_plot() -- An implementation of matplot with nice coloring and automatic legend generation.
  • kill_zombies() -- A function which takes no arguments and kills zombie R processes if the user is using a UNIX based machine.
  • estimate_plots() -- A function to parameter estimate plots with 95 percent confidence bounds for up to two models we wish to compare.
  • distinct_words() -- A function to find (semi)-distinct words in a list of term vectors.
  • combine_document_term_matrices() -- A function to combine multiple document term matrices into a single aggregate document term matrix.
  • color_words_by_frequency() -- A function to generate LaTeX output from a dataframe containing words and their frequencies. With shading based on word frequency.
  • color_word_table() -- A function to generate LaTeX output from a dataframe containing covariates and top words.
  • clean_document_text() -- A function which cleans the raw text of a document provided either as a single string, a vector of strings, or a column of a data.frame.
  • topic_coherence() -- A function to calculate topic coherence for a given topic using the formulation in "Optimizing Semantic Coherence in Topic Models" available here:.
  • frequency_threshold() -- Finds combinations of covariate values that occur more than a spcified number of times.

Copy Link

Version

Version

0.9.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

May 2nd, 2018

Functions in SpeedReader (0.9.1)

ACMI_contribution

Calculate Average Conditional Mutual Information (ACMI) contribution of each vocabulary term in a sparse DTM.
Processed_Text

Twenty bills tokenized and tagged by CORENLP
combine_document_term_matrices

A function to combine multiple document term matrices into a single aggregate document term matrix.
compare_tf_idf_scalings

A function that performs a bunch of different forms of TF-IDF scaling to a document-term matrix.
color_word_table

A function to generate LaTeX output from a dataframe containing covariates and top words.
color_words_by_frequency

A function to generate LaTeX output from a dataframe containing words and their frequencies.
count_words

A function to efficiently form aggregate word counts and a common vocabulary vector from an unordered list of document term vectors.
dice_coefficient_diff_table

Lines In Both Documents via Dice Coefficients
SpeedReader

SpeedReader: functions to facilitate high performance text processing in R.
calculate_document_pair_distances

Document Distances
dice_coefficient_line_matching

Lines In Both Documents via Dice Coefficients
distinct_words

A function to find (semi)-distinct words in a list of term vectors.
congress_bills

All versions of the first 20 bills introduced in the House and Senate in the 103rd Congress.
contingency_table

Generates a contingency table from user-specified document covariates and a document term matrix.
fightin_words_plot

A function that generates plots similar to those in Monroe et al. 'Fightin Words...'.
frequency_threshold

A function to frequency threshold a vector of strings.
get_file_paths

A function the returns the file paths to two example raw datasets for testing
generate_document_term_vectors

A function to generate document term vectors from a variety of inputs.
generate_sparse_large_document_term_matrix

A function to generate a sparse large document term matrix in blocks from a list document term vector lists stored as .Rdata object on disk. This function is designed to work on very large corpora (up to 10's of billions of words) that would otherwise be computationally intractable to generate a document term matrix for using standard methods. However, this function, and R itself, is limited to a vocaublary size of roughly 2.1 billion unique words.
sparse_to_dense_matrix

A function to convert a slam::simple_triplet_matrix sparse matrix object to a dense matrix object.
check_directory_name

A function to ensure that a directory name is in the proper format to be pasted together with a file name. It adds a trailling / if necessary.
clean_document_text

A function which cleans the raw text of a document provided either as a single string, a vector of strings, or a column of a data.frame.
corenlp_blocked

Runs Stanford CoreNLP on a collection of .txt files and processes them in blocks of a specified size, saving intermediate results to disk. Designed to function on very large corpora.
count_ngrams

An experimental function to efficiently generate a vocabulary in parallel from output produced by the ngrams() function. Cores > 1 will only work for users with GNU coreutils > 8.13 as the sort --parallel option is used. If you have an older version use cores = 1.
download_mallet

Checks the java version on your computer and downloads MALLET jar files for use with this package.
edit_metrics

Calculate Edit Metrics Between Two Document Versions
estimate_plots

A function to parameter estimate plots with 95 percent confidence bounds for up to two models we wish to compare.
feature_selection

A function that implements a number of feature selection methods for finding top words which distinguish between two classes.
generate_blocked_document_term_vectors

A function to generate and save blocks of document term vectors to coherently named files from a variety of inputs.
generate_document_term_matrix

A function to generate a document term matrix from a list of document term vectors.
order_by_counts

A function to generate an ordered word count dataframe from a raw vector of words.
pmi

A function to calculate a number of information-theoretic measures on terms in a contingency table, including point-wise mutual information.
convert_quanteda_to_slam

A function to convert a quanteda dfm object to a slam::simple_triplet_matrix.
corenlp

Runs Stanford CoreNLP on a collection of documents
document_term_vector_list

Document Term Vector List: Conressional Bills
download_corenlp

Checks the java version on your computer and downloads Stanford CoreNLP jar files for use with this package.
document_similarities

Calculate sequence based document similarities
document_term_count_list

Document Term Count List: Conressional Bills
kill_zombies

A function which takes no arguments and kills zombie R processes if the user is using a UNIX based machine
mallet_lda

A wrapper function for LDA using the MALLET machine learning toolkit -- an incredibly efficient, fast and well tested implementation of LDA. See http://mallet.cs.umass.edu/ and https://github.com/mimno/Mallet for much more information on this amazing set of libraries.
reference_distribution_distance

Reference distribtuion distances
speed_set_vocabulary

A function the reorgaizes vocabulary to speed up document term matrix formation using a string stem dictionary.
multi_dice_coefficient_matching

Multiple N-Gram Lngth Dice Coefficient Document Matching
multi_plot

An implementation of matplot with nice coloring and automatic legend generation
ngram_sequnce_plot

N-Gram Sequence Matching
ngrams

Extracts N-Grams and phrases from a collection od documents that has been preprocessed by the corenlp() function.
tfidf

A function to calculate TF-IDF and other related statistics on a set of documents.
topic_coherence

A function to calculate topic coherence for a given topic using the formulation in "Optimizing Semantic Coherence in Topic Models" available here: <http://dirichlet.net/pdf/mimno11optimizing.pdf>
get_unique_values_and_counts

Find unique values and the counts of those variables for a set of variables in a data.frame. Useful in PMI analysis and for exploring document metadata.
mutual_information

Mutual Information
ngram_sequence_matching

N-Gram Sequence Matching
unlist_and_concatenate

A function to unlist and concatenate a subset of a matrix/data.frame
sparse_doc_term_parallel

Only to be used internally. A function to generate a sparse large document term matrix in parallel.