read.documents: Read LDA-formatted Document and Vocabulary Files

Description

These functions read in the document and vocabulary files associated with a corpus. The format of the files is the same as that used by LDA-C (see below for details). The return value of these functions can be used by the inference procedures defined in the lda package.

Usage

read.documents(filename = "mult.dat")
read.vocab(filename = "vocab.dat")

Arguments

filename

A length-1 character vector specifying the path to the document/vocabulary file. These are set to mult.dat and vocab.dat by default.

Value

read.documents returns a list of matrices suitable as input for the inference routines in lda. See lda.collapsed.gibbs.sampler for details.
read.vocab returns a character vector of strings corresponding to features.

Details

The details of the format are also described in the readme for LDA-C.

The format of the documents file is appropriate for typical text data as it sparsely encodes observed features. A single file encodes a corpus (a collection of documents). Each line of the file encodes a single document (a feature vector).

The line encoding a document begins with an integer followed by a number of feature-count pairs, all separated by spaces. A feature-count pair consists of two integers separated by a colon. The first integer indicates the feature (note that this is zero-indexed!) and the second integer indicates the count (i.e., value) of that feature. The initial integer of a line indicates how many feature-count pairs are to be expected on that line.

Note that we permit a feature to appear more than once on a line, in which case the value for that feature will be the sum of all instances (the behavior for such files is undefined for LDA-C). For example, a line reading 4 7:1 0:2 7:3 1:1 will yield a document with feature 0 occurring twice, feature 1 occurring once, and feature 7 occurring four times, with all other features occurring zero times.

The format of the vocabulary is a set of newline separated strings corresponding to features. That is, the first line of the vocabulary file will correspond to the label for feature 0, the second for feature 1, etc.

References

Blei, David M. Latent Dirichlet Allocation in C. http://www.cs.princeton.edu/~blei/lda-c/index.html

Examples

Run this code

## Read files using default values.
setwd("corpus directory")
documents <- read.documents()
vocab <- read.vocab()

## Read files from another location.
documents <- read.documents("corpus directory/features")
vocab <- read.vocab("corpus directory/labels")

Run the code above in your browser using DataLab