Learn R Programming

quanteda (version 0.9.9-3)

corpus_reshape: change the document units of a corpus

Description

For a corpus, recast the documents down or up a level of aggregation. "Down" would mean going from documents to sentences, for instance. "Up" means from sentences back to documents. This makes it easy to reshape a corpus from a collection of documents into a collection of sentences, for instance. (Because the corpus object records its current "units" status, there is no from option, only to.)

Usage

corpus_reshape(x, to = c("sentences", "paragraphs", "documents"), ...)

Arguments

x
corpus whose document units will be reshaped
to
new documents units for the corpus to be recast in
...
not used

Value

A corpus object with the documents defined as the new units, including document-level meta-data identifying the original documents.

Details

Note: Only recasting down currently works, but upward recasting is planned.

Examples

Run this code
# simple example
mycorpus <- corpus(c(textone = "This is a sentence.  Another sentence.  Yet another.", 
                     textwo = "Premiere phrase.  Deuxieme phrase."), 
                   docvars = data.frame(country=c("UK", "USA"), year=c(1990, 2000)),
                   metacorpus = list(notes = "Example showing how corpus_reshape() works."))
summary(mycorpus)
summary(corpus_reshape(mycorpus, to = "sentences"), showmeta=TRUE)

# example with inaugural corpus speeches
(mycorpus2 <- corpus_subset(data_corpus_inaugural, Year>2004))
paragCorpus <- corpus_reshape(mycorpus2, to="paragraphs")
paragCorpus
summary(paragCorpus, 100, showmeta=TRUE)
## Note that Bush 2005 is recorded as a single paragraph because that text used a single
## \n to mark the end of a paragraph.

Run the code above in your browser using DataLab