corpus: constructor for corpus objects

Description

Creates a corpus from a document source. The current available document sources are:

a character vector (as in R class char) of texts;
a corpusSource-class object, constructed using textfile;
a tm VCorpus class corpus object, meaning that anything you can use to create a tm corpus, including all of the tm plugins plus the built-in functions of tm for importing pdf, Word, and XML documents, can be used to create a quanteda corpus.

Corpus-level meta-data can be specified at creation, containing (for example) citation information and notes, as can document-level variables and document-level meta-data.

Usage

corpus(x, ...)
"corpus"(x, docnames = NULL, docvars = NULL, source = NULL, notes = NULL, citation = NULL, ...)
"corpus"(x, ...)
"corpus"(x, ...)
"corpus"(x, textField, ...)
"corpus"(x, ...)
is.corpus(x)
"+"(c1, c2)
"c"(..., recursive = FALSE)
"["(x, i, j = NULL, ..., drop = TRUE)
"[["(x, i, ...)
"[["(x, i) <- value

Arguments

a source of texts to form the documents in the corpus, a character vector or a corpusSource-class object created using textfile.

...

additional arguments

docnames

Names to be assigned to the texts, defaults to the names of the character vector (if any), otherwise assigns "text1", "text2", etc.

docvars

A data frame of attributes that is associated with each text.

source

A string specifying the source of the texts, used for referencing.

notes

A string containing notes about who created the text, warnings, To Dos, etc.

citation

Information on how to cite the corpus.

textField

the character name or integer index of the source data.frame indicating the column to be read in as text. This must be of mode character.

corpus one to be added

corpus two to be added

recursive

logical used by `c()` method, always set to `FALSE`

index for documents or rows of document variables

index for column of document variables

drop

if TRUE, return a vector if extracting a single document variable; if FALSE, return it as a single-column data.frame. See drop for further details.

value

a vector that will form a new docvar

Value

A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus. A corpus currently consists of an S3 specially classed list of elements, but **you should not access these elements directly**. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).is.corpus returns TRUE if the object is a corpus

Details

The texts and document variables of corpus objects can also be accessed using index notation. Indexing a corpus object as a vector will return its text, equivalent to texts(x). Note that this is not the same as subsetting the entire corpus -- this should be done using the subset method for a corpus. Indexing a corpus using two indexes (integers or column names) will return the document variables, equivalent to docvars(x). Because a corpus is also a list, it is also possible to access, create, or replace docvars using list notation, e.g.

myCorpus[["newSerialDocvar"]] <- 
  paste0("tag", 1:ndoc(myCorpus))

The + operator for a corpus object will combine two corpus objects, resolving any non-matching docvars or metadoc fields by making them into NA values for the corpus lacking that field. Corpus-level meta data is concatenated, except for source and notes, which are stamped with information pertaining to the creation of the new joined corpus. The `c()` operator is also defined for corpus class objects, and provides an easy way to combine multiple corpus objects. There are some issues that need to be addressed in future revisions of quanteda concerning the use of factors to store document variables and meta-data. Currently most or all of these are not recorded as factors, because we use stringsAsFactors=FALSE in the data.frame calls that are used to create and store the document-level information, because the texts should always be stored as character vectors and never as factors.

Examples

Run this code

# create a corpus from texts
corpus(inaugTexts)

# create a corpus from texts and assign meta-data and document variables
ukimmigCorpus <- corpus(ukimmigTexts, 
                        docvars = data.frame(party = names(ukimmigTexts))) 

corpus(texts(ie2010Corpus))

## Not run: # the fifth column of this csv file is the text field
# mytexts <- textfile("http://www.kenbenoit.net/files/text_example.csv", textField = 5)
# mycorp <- corpus(mytexts)
# mycorp2 <- corpus(textfile("http://www.kenbenoit.net/files/text_example.csv", textField = "Title"))
# identical(texts(mycorp), texts(mycorp2))
# identical(docvars(mycorp), docvars(mycorp2))
# ## End(Not run)
# import a tm VCorpus
if ("tm" %in% rownames(installed.packages())) {
    data(crude, package = "tm")    # load in a tm example VCorpus
    mytmCorpus <- corpus(crude)
    summary(mytmCorpus, showmeta=TRUE)
    
    data(acq, package = "tm")
    summary(corpus(acq), 5, showmeta=TRUE)
    
    tmCorp <- tm::VCorpus(tm::VectorSource(inaugTexts[49:57]))
    quantCorp <- corpus(tmCorp)
    summary(quantCorp)
}

# construct a corpus from a data.frame
mydf <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)),
                  some_ints = 1L:6L,
                  some_text = paste0("This is text number ", 1:6, "."),
                  stringsAsFactors = FALSE,
                  row.names = paste0("fromDf_", 1:6))
mydf
summary(corpus(mydf, textField = "some_text", source = "From a data.frame called mydf."))

# construct a corpus from a kwic object
mykwic <- kwic(inaugCorpus, "southern")
summary(corpus(mykwic))

# concatenate corpus objects
corpus1 <- corpus(inaugTexts[1:2])
corpus2 <- corpus(inaugTexts[3:4])
corpus3 <- subset(inaugCorpus, President == "Obama")
summary(c(corpus1, corpus2, corpus3))

# ways to index corpus elements
inaugCorpus["1793-Washington"]    # 2nd Washington inaugural speech
inaugCorpus[2]                    # same
ie2010Corpus[, "year"]            # access the docvars from ie2010Corpus
ie2010Corpus[["year"]]            # same

# create a new document variable
ie2010Corpus[["govtopp"]] <- ifelse(ie2010Corpus[["party"]] %in% c("FF", "Greens"), 
                                    "Government", "Opposition")
docvars(ie2010Corpus)

Run the code above in your browser using DataLab