Learn R Programming

quanteda (version 0.99.12)

corpus-class: base method extensions for corpus objects

Description

Extensions of base R functions for corpus objects.

Usage

# S3 method for corpus
print(x, ...)

is.corpus(x)

is.corpuszip(x)

# S3 method for summary.corpus print(x, ...)

# S3 method for corpus +(c1, c2)

# S3 method for corpus c(..., recursive = FALSE)

# S3 method for corpus [(x, i, j = NULL, ..., drop = TRUE)

# S3 method for corpus [[(x, i, ...)

# S3 method for corpus [[(x, i) <- value

# S3 method for corpus str(object, ...)

Arguments

x

a corpus object

...

not used

c1

corpus one to be added

c2

corpus two to be added

recursive

logical used by `c()` method, always set to `FALSE`

i

index for documents or rows of document variables

j

index for column of document variables

drop

if TRUE, return a vector if extracting a single document variable; if FALSE, return it as a single-column data.frame. See drop for further details.

value

a vector that will form a new docvar

object

the corpus about which you want structural information

Value

is.corpus returns TRUE if the object is a corpus

is.corpuszip returns TRUE if the object is a compressed corpus

Details

The + operator for a corpus object will combine two corpus objects, resolving any non-matching docvars or metadoc fields by making them into NA values for the corpus lacking that field. Corpus-level meta data is concatenated, except for source and notes, which are stamped with information pertaining to the creation of the new joined corpus.

The `c()` operator is also defined for corpus class objects, and provides an easy way to combine multiple corpus objects.

There are some issues that need to be addressed in future revisions of quanteda concerning the use of factors to store document variables and meta-data. Currently most or all of these are not recorded as factors, because we use stringsAsFactors=FALSE in the data.frame calls that are used to create and store the document-level information, because the texts should always be stored as character vectors and never as factors.

See Also

summary.corpus

Examples

Run this code
# NOT RUN {
# concatenate corpus objects
corpus1 <- corpus(data_char_ukimmig2010[1:2])
corpus2 <- corpus(data_char_ukimmig2010[3:4])
corpus3 <- corpus(data_char_ukimmig2010[5:6])
summary(c(corpus1, corpus2, corpus3))

# ways to index corpus elements
data_corpus_inaugural["1793-Washington"]    # 2nd Washington inaugural speech
data_corpus_inaugural[2]                    # same
# access the docvars from data_corpus_irishbudget2010
data_corpus_irishbudget2010[, "year"]
# same
data_corpus_irishbudget2010[["year"]]            

# create a new document variable
data_corpus_irishbudget2010[["govtopp"]] <- 
    ifelse(data_corpus_irishbudget2010[["party"]] %in% c("FF", "Greens"), 
           "Government", "Opposition")
docvars(data_corpus_irishbudget2010)
# }

Run the code above in your browser using DataLab