htmlTreeParse
when the content is known
to be (potentially malformed) HTML.xmlTreeParse(file, ignoreBlanks=TRUE, handlers=NULL, replaceEntities=FALSE,
asText=FALSE, trim=TRUE, validate=FALSE, getDTD=TRUE,
isURL=FALSE, asTree = FALSE, addAttributeNamespaces = FALSE,
useInternalNodes = FALSE, isSchema = FALSE,
fullNamespaceInfo = FALSE, encoding = character(),
useDotNames = length(grep("^\.", names(handlers))) > 0)
htmlTreeParse(file, ignoreBlanks = TRUE, handlers = NULL,
replaceEntities = FALSE, asText = FALSE, trim = TRUE,
isURL = FALSE, asTree = FALSE,
useInternalNodes = FALSE, encoding = character(),
useDotNames = length(grep("^\.", names(handlers))) > 0)
isURL
.
Additionally, the file can be compressed (gzip)
and is read directly without the user havingfile
argument refers to a URL
(accessible via ftp or http) or a regular file on the system.
If asText
is TRUE, this should not be specified.
The function attempts to determine whether the
data sourhandlers
argument and is used then to determine
whether the DOM tree should be returned or the handlers
object.TRUE
, an attribute such as
xsi:type="xsd:string"
is reported with the name
xs
XMLInternalNode
rather than XMLNode
.
This should make things faster as we do not convert the
contents of the internal nodes to TRUE
) and should be parsed as such using
the built-in schema parser in libxml.FALSE
) is
currently the default as that was the original way the
package behaved. However, using
TRUE
FALSE
, then the older format
text, comment, startElemehandlers
argument is used similarly
to those specified in xmlEventParse.
When an XML tag (element) is processed,
we look for a function in this collection
with the same name as the tag's name.
If this is not found, we look for one named
startElement
. If this is not found, we use the default
built in converter.
The same works for comments, entity references, cdata, processing instructions,
etc.
The default entries should be named
comment
, startElement
,
externalEntity
,
processingInstruction
,
text
, cdata
and namespace
.
All but the last should take the XMLnode as their first argument.
In the future, other information may be passed via ...,
for example, the depth in the tree, etc.
Specifically, the second argument will be the parent node into which they
are being added, but this is not currently implemented,
so should have a default value (NULL
).The namespace
function is called with a single argument which
is an object of class XMLNameSpace
. This contains
One should note that the namespace
handler is called before the
node in which the namespace definition occurs and its children are
processed. This is different than the other handlers which are called
after the child nodes have been processed.
Each of these functions can return arbitrary values that are then
entered into the tree in place of the default node passed to the
function as the first argument. This allows the caller to generate
the nodes of the resulting document tree exactly as they wish. If the
function returns NULL
, the node is dropped from the resulting
tree. This is a convenient way to discard nodes having processed their
contents.
file
, version
and children
.
file
version
children
XMLNode
.
These are made up of 4 fields.
name
attributes
children
value
XMLNode
, such as
XMLComment
, XMLProcessingInstruction
,
XMLEntityRef
are used.
If the value of the argument getDTD is TRUE, the return value is a
list of length 2. The first element is as the document as described
above. The second element is a list containing the external and
internal DTDs. Each of these contains 2 lists - one for elements
and another for entities. See parseDTD
.
}
free
for releasing the memory when
an XMLInternalDocument
object is returned.
xmlTreeParse(fileName)
# parse the document, discarding comments. xmlTreeParse(fileName, handlers=list("comment"=function(x,...){NULL}), asTree = TRUE)
# print the entities
invisible(xmlTreeParse(fileName,
handlers=list(entity=function(x) {
cat("In entity",x$name, x$value,"
# Parse some XML text.
# Read the text from the file
xmlText <- paste(readLines(fileName), "
print(xmlText) xmlTreeParse(xmlText, asText=TRUE)
# with version 1.4.2 we can pass the contents of an XML # stream without pasting them. xmlTreeParse(readLines(fileName), asText=TRUE)
# Read a MathML document and convert each node
# so that the primary class is
#
# In this example, we extract _just_ the names of the
# variables in the mtcars.xml file.
# The names are the contents of the
# Note that we define the closure function in the call and then # create an instance of it by calling it directly as # (function() {...})()
# Note that we can get the names by parsing # in the usual manner and the entire document and then executing # xmlSApply(xmlRoot(doc)[[1]], function(x) xmlValue(x[[1]])) # which is simpler but is more costly in terms of memory. fileName <- system.file("exampleData", "mtcars.xml", package="XML") doc <- xmlTreeParse(fileName, handlers = (function() { vars <- character(0) ; list(variable=function(x, attrs) { vars <<- c(vars,="" xmlvalue(x[[1]]));="" null},="" startelement="function(x,attr){" null="" },="" names="function()" {="" vars="" }="" )="" })()="" )<="" p="">
# Here we just print the variable names to the console # with a special handler. doc <- xmlTreeParse(fileName, handlers = list( variable=function(x, attrs) { print(xmlValue(x[[1]])); TRUE }), asTree=TRUE)
# This should raise an error. try(xmlTreeParse( system.file("exampleData", "TestInvalid.xml", package="XML"), validate=TRUE))
# Parse an XML document directly from a URL. # Requires Internet access. xmlTreeParse("http://www.omegahat.org/Scripts/Data/mtcars.xml", asText=TRUE)
counter = function() { counts = integer(0) list(startElement = function(node) { name = xmlName(node) if(name %in% names(counts)) counts[name] <<- 1="" counts[name]="" +="" else="" <<-="" },="" counts="function()" counts)="" }<="" p="">
h = counter() xmlTreeParse(system.file("exampleData", "mtcars.xml", package="XML"), handlers = h, useInternalNodes = TRUE) h$counts()
f = system.file("examples", "index.html", package = "XML") htmlTreeParse(readLines(f), asText = TRUE) htmlTreeParse(readLines(f))
# Same as
htmlTreeParse(paste(readLines(f), collapse = "
getLinks = function() { links = character() list(a = function(node, ...) { links <<- c(links,="" xmlgetattr(node,="" "href"))="" node="" },="" links="function()links)" }<="" p="">
h1 = getLinks() htmlTreeParse(system.file("examples", "index.html", package = "XML"), handlers = h1) h1$links()
h2 = getLinks() htmlTreeParse(system.file("examples", "index.html", package = "XML"), handlers = h2, useInternalNodes = TRUE) all(h1$links() == h2$links())
# Using flat trees tt = xmlHashTree() f = system.file("exampleData", "mtcars.xml", package="XML") xmlTreeParse(f, handlers = tt[[".addNode"]]) xmlRoot(tt)
doc = xmlTreeParse(f, useInternalNodes = TRUE)
sapply(getNodeSet(doc, "//variable"), xmlValue) free(doc)
# character set encoding for HTML f = system.file("exampleData", "9003.html", package = "XML") # we specify the encoding d = htmlTreeParse(f, encoding = "UTF-8") # get a different result if we do not specify any encoding d.no = htmlTreeParse(f) # document with its encoding in the HEAD of the document. d.self = htmlTreeParse(system.file("exampleData", "9003-en.html",package = "XML")) # XXX want to do a test here to see the similarities between d and # d.self and differences between d.no