getHTMLLinks: Get links or names of external files in HTML document
Description
These functions allow us to retrieve either the links
within an HTML document, or the collection of names of
external files referenced in an HTML document.
The external files include images, JavaScript and CSS documents.
getHTMLLinks returns a character vector of the links.
getHTMLExternalFiles returns a character vector.
Arguments
doc
the HTML document as a URL, local file name, parsed
document or an XML/HTML node
externalOnly
a logical value that indicates whether we should
only return links to external documents and not references to
internal anchors/nodes within this document, i.e. those that of the
form #foo.
xpQuery
a vector of XPath elements which match the elements of interest
baseURL
the URL of the container document. This is used
to resolve relative references/links.
relative
a logical value indicating whether to leave the
references as relative to the base URL or to expand them to their full paths.
asNodes
a logical value that indicates whether we want the actual
HTML/XML nodes in the document that reference external documents
or just the names of the external documents.
recursive
a logical value that controls whether we recursively
process the external documents we find in the top-level document
examining them for their external files.
# site is flaky try(getHTMLLinks("https://www.omegahat.net"))
try(getHTMLLinks("https://www.omegahat.net/RSXML"))
try(unique(getHTMLExternalFiles("https://www.omegahat.net")))