Representing and computing on text documents.
Text documents are documents containing (natural language)
text. In packages which employ the infrastructure provided by package
NLP, such documents are represented via the virtual S3 class
"TextDocument"
: such packages then provide S3 text document
classes extending the virtual base class (such as the
AnnotatedPlainTextDocument
objects provided by package
NLP itself).
All extension classes must provide an as.character()
method which extracts the natural language text in documents of the
respective classes in a “suitable” (not necessarily structured)
form, as well as content()
and meta()
methods for accessing the (possibly raw) document content and metadata.
In addition, the infrastructure features the generic functions
words()
, sents()
, etc., for which
extension classes can provide methods giving a structured view of the
text contained in documents of these classes (returning, e.g., a
character vector with the word tokens in these documents, and a list
of such character vectors).
AnnotatedPlainTextDocument
,
CoNLLTextDocument
,
CoNLLUTextDocument
,
TaggedTextDocument
, and
WordListDocument
for the text document classes provided by package NLP.