corpus-class: Base method extensions for corpus objects

Description

Extensions of base R functions for corpus objects.

Usage

# S3 method for corpus
+(c1, c2)
# S3 method for corpus
c(..., recursive = FALSE)
# S3 method for corpus
[(x, i, drop_docid = TRUE)
# S3 method for summary.corpus
print(x, ...)

Value

The + and c() operators return a corpus() object.

Indexing a corpus works in three ways, as of v2.x.x:

[ returns a subsetted corpus
[[ returns the textual contents of a subsetted corpus (similar to as.character())
$ returns a vector containing the single named docvars

Arguments

c1: corpus one to be added
c2: corpus two to be added
recursive: logical used by c() method, always set to FALSE
x: a corpus object
i: document names or indices for documents to extract.
drop_docid: if TRUE, drop docid for documents removed as the result of extraction.

Details

The + operator for a corpus object will combine two corpus objects, resolving any non-matching docvars() by making them into NA values for the corpus lacking that field. Corpus-level meta data is concatenated, except for source and notes, which are stamped with information pertaining to the creation of the new joined corpus.

The c() operator is also defined for corpus class objects, and provides an easy way to combine multiple corpus objects.

There are some issues that need to be addressed in future revisions of quanteda concerning the use of factors to store document variables and meta-data. Currently most or all of these are not recorded as factors, because we use stringsAsFactors=FALSE in the data.frame() calls that are used to create and store the document-level information, because the texts should always be stored as character vectors and never as factors.

Examples

Run this code

# concatenate corpus objects
corp1 <- corpus(data_char_ukimmig2010[1:2])
corp2 <- corpus(data_char_ukimmig2010[3:4])
corp3 <- corpus(data_char_ukimmig2010[5:6])
summary(c(corp1, corp2, corp3))

# two ways to index corpus elements
data_corpus_inaugural["1793-Washington"]
data_corpus_inaugural[2]

# return the text itself
data_corpus_inaugural[["1793-Washington"]]