prepDocuments: Prepare documents for analysis with `stm`

Description

Performs several corpus manipulations including removing words and renumbering word indices (to correct for zero-indexing and/or unused words in the vocab vector).

Usage

prepDocuments(documents, vocab, meta = NULL, lower.thresh = 1,
  upper.thresh = Inf, subsample = NULL, verbose = TRUE)

Arguments

documents

List of documents. For more on the format see stm.

vocab

Character vector of words in the vocabulary.

Value

A list containing a new documents and vocab object.

documents

The new documents object for use with stm

vocab

The new vocab object for use with stm

Details

The default setting lower.thresh=1 means that words which appear in only one document will be dropped. This is often advantageous as there is little information about these words but the added cost of including them in the model can be quite large. In many cases it will be helpful to set this threshold considerably higher. If the vocabulary is in excess of 5000 entries inference can slow quite a bit.

If words are removed, the function returns a vector of the original indices for the dropped items. If it removed documents it returns a vector of doc indices removed. Users with accompanying metadata or texts may want to drop those rows from the corresponding objects.

The behavior is such that when prepDocuments drops documents their corresponding rows are deleted and the row names are not renumbered. We however do not recommend using rownames for joins- instead the best practice is to either keep a unique identifier in the meta object for doing joins or use something like quanteda which has a more robust interface for manipulating the corpus itself.

If you have any documents which are of length 0 in your original object the function will throw an error. These should be removed before running the function although please be sure to remove the corresponding rows in the meta data file if you have one. You can quickly identify the documents using the code: which(unlist(lapply(documents, length))==0).

Examples

Run this code

# NOT RUN {
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
# }