corpus: construct a corpus object

Description

Creates a corpus object from available sources. The currently available sources are:

a character vector, consisting of one document per element; if the elements are named, these names will be used as document names.
a readtext object, from the readtext package (which is a specially constructed data.frame)
a data.frame, whose default variable containing the document is character vector named text, although this can be set to any other variable name using the text_field argument. Other variables are imported as document-level meta-data.
a kwic object constructed by kwic.
a tm VCorpus class object, with the fixed metadata fields imported as document-level metadata. Corpus-level metadata is not currently imported.

Usage

corpus(x, docnames = NULL, docvars = NULL, text_field = "text",
  metacorpus = NULL, compress = FALSE, ...)

Arguments

a valid corpus source object

docnames

Names to be assigned to the texts, defaults to the names of the character vector (if any), otherwise assigns "text1", "text2", etc.

docvars

A data frame of attributes that is associated with each text.

text_field

the character name or numeric index of the source data.frame indicating the variable to be read in as text, which must be a character vector. All other variables in the data.frame will be imported as docvars. This argument is only used for data.frame objects (including those created by readtext).

metacorpus

a named list containing additional (character) information to be added to the corpus as corpus-level metadata. Special fields recognized in the summary.corpus are:

source a description of the source of the texts, used for referencing;
citation information on how to cite the corpus; and
notes any additional information about who created the text, warnings, to do lists, etc.

compress

logical; if TRUE, compress the texts in memory using gzip compression. This significantly reduces the size of the corpus in memory, but will slow down operations that require the texts to be extracted.

...

not used directly

Value

A corpus-class class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.

A warning on accessing corpus elements

A corpus currently consists of an S3 specially classed list of elements, but you should not access these elements directly. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).

Details

The texts and document variables of corpus objects can also be accessed using index notation. Indexing a corpus object as a vector will return its text, equivalent to texts(x). Note that this is not the same as subsetting the entire corpus -- this should be done using the subset method for a corpus.

Indexing a corpus using two indexes (integers or column names) will return the document variables, equivalent to docvars(x). Because a corpus is also a list, it is also possible to access, create, or replace docvars using list notation, e.g.

myCorpus[["newSerialDocvar"]] <- paste0("tag", 1:ndoc(myCorpus)).

For details, see corpus-class.

Examples

Run this code

# create a corpus from texts
corpus(data_char_ukimmig2010)

# create a corpus from texts and assign meta-data and document variables
summary(corpus(data_char_ukimmig2010, 
               docvars = data.frame(party = names(data_char_ukimmig2010))), 5) 

corpus(texts(data_corpus_irishbudget2010))

# import a tm VCorpus
if ("tm" %in% rownames(installed.packages())) {
    data(crude, package = "tm")    # load in a tm example VCorpus
    mytmCorpus <- corpus(crude)
    summary(mytmCorpus, showmeta=TRUE)
    
    data(acq, package = "tm")
    summary(corpus(acq), 5, showmeta=TRUE)
    
    tmCorp <- tm::VCorpus(tm::VectorSource(data_char_ukimmig2010))
    quantCorp <- corpus(tmCorp)
    summary(quantCorp)
}

# construct a corpus from a data.frame
mydf <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)),
                  some_ints = 1L:6L,
                  some_text = paste0("This is text number ", 1:6, "."),
                  stringsAsFactors = FALSE,
                  row.names = paste0("fromDf_", 1:6))
mydf
summary(corpus(mydf, text_field = "some_text", 
               metacorpus = list(source = "From a data.frame called mydf.")))

# construct a corpus from a kwic object
mykwic <- kwic(data_corpus_inaugural, "southern")
summary(corpus(mykwic))

Run the code above in your browser using DataLab