read.documents(filename = "mult.dat")read.vocab(filename = "vocab.dat")
read.documents
returns a list of matrices suitable as input for
the inference routines in lda.collapsed.gibbs.sampler
for details. read.vocab
returns a character vector of strings corresponding to
features.
The format of the documents file is appropriate for typical text data as it sparsely encodes observed features. A single file encodes a corpus (a collection of documents). Each line of the file encodes a single document (a feature vector).
The line encoding a document begins with an integer followed by a number of feature-count pairs, all separated by spaces. A feature-count pair consists of two integers separated by a colon. The first integer indicates the feature (note that this is zero-indexed!) and the second integer indicates the count (i.e., value) of that feature. The initial integer of a line indicates how many feature-count pairs are to be expected on that line.
Note that we permit a feature to appear more than once on a line, in which case the value for that feature will be the sum of all instances (the behavior for such files is undefined for LDA-C). For example, a line reading 4 7:1 0:2 7:3 1:1 will yield a document with feature 0 occurring twice, feature 1 occurring once, and feature 7 occurring four times, with all other features occurring zero times.
The format of the vocabulary is a set of newline separated strings corresponding to features. That is, the first line of the vocabulary file will correspond to the label for feature 0, the second for feature 1, etc.
lda.collapsed.gibbs.sampler
for the format of
the return value of read.documents
. lexicalize
to generate the same output from raw text data.
word.counts
to compute statistics associated with a
corpus.
merge.documents
for operations on a collection of documents.
## Read files using default values.
setwd("corpus directory")
documents <- read.documents()
vocab <- read.vocab()
## Read files from another location.
documents <- read.documents("corpus directory/features")
vocab <- read.vocab("corpus directory/labels")
Run the code above in your browser using DataLab