Learn R Programming

tm (version 0.3-1)

readXML: Read In an XML Document

Description

Returns a function which reads in an XML document. The structure of the XML document can be described with a so-called specification.

Usage

readXML(spec, doc, ...)

Arguments

spec
a named list of lists each containing two character vectors. The constructed reader will map each list entry to a slot or meta datum corresponding to the named list entry. Valid names include .Data<
doc
an (empty) document of some subclass of TextDocument
...
arguments for the generator function.

Value

  • A function with the signature elem, language, load, id:
  • elemA list with the two named elements content and uri. The first element must hold the document to be read in, the second element must hold a call to extract this document. The call is evaluated upon a request for load on demand.
  • loadA logical value indicating whether the document corpus should be immediately loaded into memory.
  • languageA character vector giving the text's language.
  • idA character vector representing a unique identification string for the returned text document.
  • The function returns doc augmented by the parsed information out of the XML file as described by spec.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.

See Also

Vignette 'Extensions: How to Handle Custom File Formats'.

Use getReaders to list available reader functions.

Examples

Run this code
readReut21578XML <- readXML(
  spec = list(Author = list("node", "/REUTERS/TEXT/AUTHOR"),
              DateTimeStamp = list("function", function(node)
                strptime(sapply(XML::getNodeSet(node, "/REUTERS/DATE"), XML::xmlValue),
                         format = "                         tz = "GMT")),
              Description = list("unevaluated", ""),
              Heading = list("node", "/REUTERS/TEXT/TITLE"),
              ID = list("attribute", "/REUTERS/@NEWID"),
              Origin = list("unevaluated", "Reuters-21578 XML"),
              Topics = list("node", "/REUTERS/TOPICS/D")),
  doc = new("Reuters21578Document"))

Run the code above in your browser using DataLab