SimpleCorpus: Simple Corpora

Description

Create simple corpora.

Usage

SimpleCorpus(x, control = list(language = "en"))

Arguments

a DirSource or VectorSource.

control

a named list of control parameters.

language: a character giving the language (preferably as IETF language tags, see language in package NLP). The default language is assumed to be English ("en").

Value

An object inheriting from SimpleCorpus and Corpus.

Details

A simple corpus is fully kept in memory. Compared to a VCorpus, it is optimized for the most common usage scenario: importing plain texts from files in a directory or directly from a vector in R, preprocessing and transforming the texts, and finally exporting them to a term-document matrix. It adheres to the Corpus API. However, it takes internally various shortcuts to boost performance and minimize memory pressure; consequently it operates only under the following contraints:

only DirSource and VectorSource are supported,
no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one),
transformations applied via tm_map must be able to process character vectors and return character vectors (of the same length),
no lazy transformations in tm_map,
no meta data for individual documents (i.e., no "local" in meta).

Examples

Run this code

# NOT RUN {
txt <- system.file("texts", "txt", package = "tm")
(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),
                      control = list(language = "lat")))
# }

Run the code above in your browser using DataLab