filter.words: Functions to manipulate text corpora in LDA format.

Description

merge.documents concatenates a set of documents. filter.words removes references to certain words from a collection of documents. shift.word.indices adjusts references to words by a fixed amount.

Usage

merge.documents(...)
filter.words(documents, to.remove)
shift.word.indices(documents, amount)

Arguments

...

For merge.documents, the set of corpora to be merged. All arguments to ... must be corpora of the same length. The documents in the same position in each of the arguments will be concatenated, i.e., the new document 1 wil

documents

For filter.words and shift.word.indices, the corpus to be operated on.

to.remove

For filter.words, an integer vector of words to filter. The words in each document which also exist in to.remove will be removed.

amount

For shift.word.indices, an integer scalar by which to shift the vocabulary in the corpus. amount will be added to each entry of the word field in the corpus.

Value

A corpus with the documents merged/words filtered/words shifted. The format of the input and output corpora is described in lda.collapsed.gibbs.sampler.

Examples

Run this code

data(cora.documents)

## Just use a small subset for the example.
corpus <- cora.documents[1:6]
## Get the word counts.
wc <- word.counts(corpus)

## Only keep the words which occur more than 4 times.
filtered <- filter.words(corpus,
                         as.numeric(names(wc)[wc <= 4]))
## [[1]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   23   34   37   44
## [2,]    4    1    3    4    1
##
## [[2]]
##      [,1] [,2]
## [1,]   34   94
## [2,]    1    1
## ... long output ommitted ...

## Shift the second half of the corpus.
shifted <- shift.word.indices(filtered[4:6], 100)
## [[1]]
##      [,1] [,2] [,3]
## [1,]  134  281  307
## [2,]    2    5    7
##
## [[2]]
##      [,1] [,2]
## [1,]  101  123
## [2,]    1    4
##
## [[3]]
##      [,1] [,2]
## [1,]  101  194
## [2,]    6    3

## Combine the unshifted documents and the shifted documents.
merge.documents(filtered[1:3], shifted)
## [[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1   23   34   37   44  134  281  307
## [2,]    4    1    3    4    1    2    5    7
##
## [[2]]
##      [,1] [,2] [,3] [,4]
## [1,]   34   94  101  123
## [2,]    1    1    1    4
##
## [[3]]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   34   37   44   94  101  194
## [2,]    4    1    7    1    6    3

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples