lexicalize: Generate LDA Documents from Raw Text

Description

This function reads raw text in doclines format and returns a corpus and vocabulary suitable for the inference procedures defined in the lda package.

Usage

lexicalize(doclines, sep = " ", lower = TRUE, count = 1L, vocab = NULL)

Arguments

doclines

The pathname pointing to a text file which will be used to construct a corpus. See details for a description of the format of this file.

sep

Separator string which is used to tokenize the input strings (default ).

lower

Logical indicating whether or not to convert all tokens to lowercase (default TRUE).

count

An integer scaling factor to be applied to feature counts. A single observation of a feature will be rendered as count observations in the return value (the default value, 1, is appropriate in most cases).

vocab

If left unspecified (or NULL), the vocabulary for the corpus will be automatically inferred from the observed tokens. Otherwise, this parameter should be a character vector specifying acceptable tokens. Tokens not appearing in t

Value

If vocab is unspecified or NULL, a list with two components:
documentsdocuments
vocabA character vector of unique tokens occurring in the corpus.

Details

This function first tokenizes a character vector by splitting each entry of the vector by sep (note that this is currently a fixed separator, not a regular expression). If lower is TRUE, then the tokens are then all converted to lowercase.

At this point, if vocab is NULL, then a vocabulary is constructed from the set of unique tokens appearing across all character vectors. Otherwise, the tokens derived from the character vectors are filtered so that only those appearing in vocab are retained.

Finally, token instances within each document (i.e., original character string) are tabulated in the format described in lda.collapsed.gibbs.sampler.

Examples

Run this code

## Generate an example.
example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE)

## corpus$documents:
## $documents[[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    1    2    3    4    5    6    7    8     9
## [2,]    1    1    1    1    1    1    1    1    1     1
## 
## $documents[[2]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   10    6    8   11
## [2,]    1    1    1    1    1

## corpus$lexicon:
## $vocab
## [1] "i"        "am"       "the"      "very"     "model"    "of"      
## [7] "a"        "modern"   "major"    "general"  "have"     "headache"

## Only keep words that appear at least twice:
to.keep <- corpus$vocab[word.counts(corpus$documents, corpus$vocab) >= 2]

## Re-lexicalize, using this subsetted vocabulary
documents <- lexicalize(example, lower=TRUE, vocab=to.keep)

## documents:
## [[1]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1

Run the code above in your browser using DataLab