Learn R Programming

lda (version 1.1)

lexicalize: Generate LDA Documents from Raw Text

Description

This function reads raw text in doclines format and returns a corpus and vocabulary suitable for the inference procedures defined in the lda package.

Usage

lexicalize(doclines, sep = " ", lower = TRUE, count = 1L, vocab = NULL)

Arguments

doclines
The pathname pointing to a text file which will be used to construct a corpus. See details for a description of the format of this file.
sep
Separator string which is used to tokenize the input strings (default ).
lower
Logical indicating whether or not to convert all tokens to lowercase (default TRUE).
count
An integer scaling factor to be applied to feature counts. A single observation of a feature will be rendered as count observations in the return value (the default value, 1, is appropriate in most cases).
vocab
If left unspecified (or NULL), the vocabulary for the corpus will be automatically inferred from the observed tokens. Otherwise, this parameter should be a character vector specifying acceptable tokens. Tokens not appearing in t

Value

  • If vocab is unspecified or NULL, a list with two components:
  • documentsdocuments
  • vocabA character vector of unique tokens occurring in the corpus.

Details

This function first tokenizes a character vector by splitting each entry of the vector by sep (note that this is currently a fixed separator, not a regular expression). If lower is TRUE, then the tokens are then all converted to lowercase.

At this point, if vocab is NULL, then a vocabulary is constructed from the set of unique tokens appearing across all character vectors. Otherwise, the tokens derived from the character vectors are filtered so that only those appearing in vocab are retained.

Finally, token instances within each document (i.e., original character string) are tabulated in the format described in lda.collapsed.gibbs.sampler.

See Also

lda.collapsed.gibbs.sampler for the format of the return value.

read.documents to generate the same output from a file encoded in LDA-C format.

word.counts to compute statistics associated with a corpus.

merge.documents for operations on a collection of documents.

Examples

Run this code
## Generate an example.
example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE)

## corpus$documents:
## $documents[[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    1    2    3    4    5    6    7    8     9
## [2,]    1    1    1    1    1    1    1    1    1     1
## 
## $documents[[2]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   10    6    8   11
## [2,]    1    1    1    1    1

## corpus$lexicon:
## $vocab
## [1] "i"        "am"       "the"      "very"     "model"    "of"      
## [7] "a"        "modern"   "major"    "general"  "have"     "headache"

## Only keep words that appear at least twice:
to.keep <- corpus$vocab[word.counts(corpus$documents, corpus$vocab) >= 2]

## Re-lexicalize, using this subsetted vocabulary
documents <- lexicalize(example, lower=TRUE, vocab=to.keep)

## documents:
## [[1]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1

Run the code above in your browser using DataLab