readXML: Read In an XML Document

Description

Returns a function which reads in an XML document. The structure of the XML document can be described with a so-called specification.

Usage

readXML(spec, doc, ...)

Arguments

spec

a named list of lists each containing two character vectors. The constructed reader will map each list entry to a slot or meta datum corresponding to the named list entry. Valid names include .Data<

doc

an (empty) document of some subclass of TextDocument

...

arguments for the generator function.

Value

A function with the signature elem, language, load, id:
elemA list with the two named elements content and uri. The first element must hold the document to be read in, the second element must hold a call to extract this document. The call is evaluated upon a request for load on demand.
loadA logical value indicating whether the document corpus should be immediately loaded into memory.
languageA character vector giving the text's language.
idA character vector representing a unique identification string for the returned text document.
The function returns doc augmented by the parsed information out of the XML file as described by spec.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.

Examples

Run this code

readReut21578XML <- readXML(
  spec = list(Author = list("node", "/REUTERS/TEXT/AUTHOR"),
              DateTimeStamp = list("function", function(node)
                strptime(sapply(XML::getNodeSet(node, "/REUTERS/DATE"), XML::xmlValue),
                         format = "                         tz = "GMT")),
              Description = list("unevaluated", ""),
              Heading = list("node", "/REUTERS/TEXT/TITLE"),
              ID = list("attribute", "/REUTERS/@NEWID"),
              Origin = list("unevaluated", "Reuters-21578 XML"),
              Topics = list("node", "/REUTERS/TOPICS/D")),
  doc = new("Reuters21578Document"))