Return a function which reads in an XML document. The structure of the XML document is described with a specification.
readXML(spec, doc)
A function with the following formals:
elem
a named list with the component content
which
must hold the document to be read in.
language
a string giving the language.
id
a character giving a unique identifier for the created text document.
The function returns doc
augmented by the parsed information
as described by spec
out of the XML file in
elem$content
. The arguments language
and id
are used as
fallback: language
if no corresponding metadata entry is found in
elem$content
, and id
if no corresponding metadata entry is found
in elem$content
and if elem$uri
is null.
A named list of lists each containing two components. The
constructed reader will map each list entry to the content or metadatum of
the text document as specified by the named list entry. Valid names include
content
to access the document's content, and character strings which
are mapped to metadata entries.
Each list entry must consist of two components: the first must be a string describing the type of the second argument, and the second is the specification entry. Valid combinations are:
type = "node", spec = "XPathExpression"
The XPath (1.0)
expression spec
extracts information from an XML node.
type = "function", spec = function(doc) ...
The function
spec
is called, passing over the XML document (as
delivered by read_xml
from package xml2) as
first argument.
type = "unevaluated", spec = "String"
The character vector
spec
is returned without modification.
An (empty) document of some subclass of TextDocument
.
Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.
Reader
for basic information on the reader infrastructure
employed by package tm.
Vignette 'Extensions: How to Handle Custom File Formats', and
XMLSource
.