a character giving the language (preferably as
IETF language tags, see language in
package NLP).
The default language is assumed to be English ("en").
Value
An object inheriting from SimpleCorpus and Corpus.
Details
A simple corpus is fully kept in memory. Compared to a VCorpus,
it is optimized for the most common usage scenario: importing plain texts from
files in a directory or directly from a vector in R, preprocessing and
transforming the texts, and finally exporting them to a term-document matrix.
It adheres to the Corpus API. However, it takes
internally various shortcuts to boost performance and minimize memory
pressure; consequently it operates only under the following contraints:
only DataframeSource, DirSource and VectorSource
are supported,
no custom readers, i.e., each document is read in and stored as plain
text (as a string, i.e., a character vector of length one),
transformations applied via tm_map must be able to
process character vectors and return character vectors (of the same
length),