html_nodes: Select nodes from an HTML document

Description

More easily extract pieces out of HTML documents using XPath and css selectors. CSS selectors are particularly useful in conjunction with http://selectorgadget.com/: it makes it easy to find exactly which selector you should be using. If you have't used css selectors before, work your way through the fun tutorial at http://flukeout.github.io/

Usage

html_nodes(x, css, xpath)
html_node(x, css, xpath)

Arguments

Either a document, a node set or a single node.

css, xpath

Nodes to select. Supply one of css or xpath depending on whether you want to use a css or xpath selector.

<code>html_node</code> vs <code>html_nodes</code>

html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

CSS selector support

CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.

It implements the majority of CSS3 selectors, as described in http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:

Pseudo selectors that require interactivity are ignored::hover,:active,:focus,:target,:visited
The following pseudo classes don't work with the wild card element, *:*:first-of-type,*:last-of-type,*:nth-of-type,*:nth-last-of-type,*:only-of-type
It supports:contains(text)
You can use !=,[foo!=bar]is the same as:not([foo=bar])
:not()accepts a sequence of simple selectors, not just single simple selector.

Examples

Run this code

# CSS selectors ----------------------------------------------
ateam <- read_html("http://www.boxofficemojo.com/movies/?id=ateam.htm")
html_nodes(ateam, "center")
html_nodes(ateam, "center font")
html_nodes(ateam, "center font b")

# But html_node is best used in conjunction with \%>\% from magrittr
# You can chain subsetting:
ateam %>% html_nodes("center") %>% html_nodes("td")
ateam %>% html_nodes("center") %>% html_nodes("font")

# When applied to a list of nodes, html_nodes() collapses output
# html_node() throws an error
td <- ateam %>% html_nodes("center") %>% html_nodes("td")
td %>% html_nodes("font")
td %>% html_node("font")

# To pick out an element at specified position, use magrittr::extract2
# which is an alias for [[
library(magrittr)
ateam %>% html_nodes("table") %>% extract2(1) %>% html_nodes("img")
ateam %>% html_nodes("table") %>% `[[`(1) %>% html_nodes("img")

# Find all images contained in the first two tables
ateam %>% html_nodes("table") %>% `[`(1:2) %>% html_nodes("img")
ateam %>% html_nodes("table") %>% extract(1:2) %>% html_nodes("img")

# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
ateam %>%
  html_nodes(xpath = "//center//font//b") %>%
  html_nodes(xpath = "//b")

Run the code above in your browser using DataLab