Data structures and operators for distributed corpora.
DCorpus( x,
readerControl = list(reader = reader(x),
language = "en"),
storage = NULL, keep = TRUE, ... )
# S3 method for DCorpus
as.VCorpus(x)
as.DCorpus( x, storage = NULL, ... )
A list with the named components reader
representing a reading function capable of handling the file format
found in x
, and language
giving the text's language
(preferably as IETF language tags, see language in
package NLP).
The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'.
Should revisions be used when operating on the
DCorpus
? Default: TRUE
Optional arguments for the reader
.
An object inheriting from DCorpus
and Corpus
.
When constructing a distributed corpus the input source is
extracted via the supplied reader and stored on the given file
system (argument storage
). While the data set resides on the
corresponding storage (e.g., HDFS), only a symbolic representation is
held in R (a so-called DList
) which allows to
access the corpus via corresponding (DList
) methods. Since the
available memory for the distributed corpus is only restricted by
available disk space in the given storage (and not main memory like in
a standard tm corpus) by default we also store a set of
so-called revisions, i.e., stages of the (processed) corpus. Revisions
can be turned off later on using the keepRevisions()
replacement function.\
The constructed corpus object inherits from a tm
Corpus
and has several slots containing meta
information:
meta
Corpus Meta Data contains corpus specific meta data in form of tag-value pairs.
dmeta
Document Meta Data of class
data.frame
contains document specific meta data for the
corpus. This is mainly available to be compatible with standard
tm corpus definitions but not yet actually used in the
distributed scenario.
keep
A logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.
Corpus
for basic information on the corpus infrastructure
employed by package tm.
# NOT RUN {
## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc
## Coercion
data("crude")
as.DistributedCorpus(crude)
as.VCorpus(dc)
# }
Run the code above in your browser using DataLab