alignCorpus: Align the vocabulary of a new corpus to an old corpus

Description

Function that takes in a list of documents, vocab and (optionally) metadata for a corpus of previously unseen documents and aligns them to an old vocabulary. Helps preprocess documents for fitNewDocuments.

Usage

alignCorpus(new, old.vocab, verbose = TRUE)

Arguments

new

a list (such as those produced by textProcessor or prepDocuments) containing a list of documents in stm format, a character vector containing the vocabulary and optional a data.frame containing meta data. These should be labeled documents, vocab,and meta respectively. This is the new set of unseen documents which will be returned with the vocab renumbered and all words not appearing in old removed.

old.vocab

a character vector containing the vocabulary that you want to align to. In general this will be the vocab used in your original stm model fit which from an stm object called mod can be accessed as mod$vocab.

verbose

a logical indicating whether information about the new corpus should be printed to the screen. Defaults to TRUE.

Value

documents

A list containing the documents in the stm format.

vocab

Character vector of vocabulary.

Details

When estimating topic proportions for previously unseen documents using fitNewDocuments the new documents must have the same vocabulary ordered in the same was as the original model. This function helps with that process.

Note: the code is not really built for speed or memory efficiency- if you are trying to do this with a really large corpus of new texts you might consider building the object yourself using quanteda or some other option.

Examples

Run this code

# NOT RUN {
#we process an original set that is just the first 100 documents
temp<-textProcessor(documents=gadarian$open.ended.response[1:100],metadata=gadarian[1:100,])
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
set.seed(02138)
#Maximum EM its is set low to make this run fast, run models to convergence!
mod.out <- stm(out$documents, out$vocab, 3, prevalence=~treatment + s(pid_rep), 
              data=out$meta, max.em.its=5)
#now we process the remaining documents
temp<-textProcessor(documents=gadarian$open.ended.response[101:nrow(gadarian)],
                    metadata=gadarian[101:nrow(gadarian),])
#note we don't run prepCorpus here because we don't want to drop any words- we want 
#every word that showed up in the old documents.
newdocs <- alignCorpus(new=temp, old.vocab=mod.out$vocab)
#we get some helpful feedback on what has been retained and lost in the print out.
#and now we can fit our new held-out documents
fitNewDocuments(model=mod.out, documents=newdocs$documents, newData=newdocs$meta,
                origData=out$meta, prevalence=~treatment + s(pid_rep),
                prevalencePrior="Covariate")
# }