html_element: Select elements from an HTML document

Description

html_element() and html_elements() find HTML element using CSS selectors or XPath expressions. CSS selectors are particularly useful in conjunction with https://selectorgadget.com/, which makes it very easy to discover the selector you need.

Usage

html_element(x, css, xpath)
html_elements(x, css, xpath)

Value

html_element() returns a nodeset the same length as the input. html_elements() flattens the output so there's no direct way to map the output to the input.

Arguments

x: Either a document, a node set or a single node.
css, xpath: Elements to select. Supply one of css or xpath depending on whether you want to use a CSS selector or XPath 1.0 expression.

CSS selector support

CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.

It implements the majority of CSS3 selectors, as described in https://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:

Pseudo selectors that require interactivity are ignored: :hover, :active, :focus, :target, :visited.
The following pseudo classes don't work with the wild card element, *: *:first-of-type, *:last-of-type, *:nth-of-type, *:nth-last-of-type, *:only-of-type
It supports :contains(text)
You can use !=, [foo!=bar] is the same as :not([foo=bar])
:not() accepts a sequence of simple selectors, not just a single simple selector.

Examples

Run this code

html <- minimal_html("
  This is a heading
  This is a paragraph
  This is an important paragraph
")

html %>% html_element("h1")
html %>% html_elements("p")
html %>% html_elements(".important")
html %>% html_elements("#first")

# html_element() vs html_elements() --------------------------------------
html <- minimal_html("
  
    C-3PO is a droid that weighs 167 kg
    R2-D2 is a droid that weighs 96 kg
    Yoda weighs 66 kg
    R4-P17 is a droid
  
")
li <- html %>% html_elements("li")

# When applied to a node set, html_elements() returns all matching elements
# beneath any of the inputs, flattening results into a new node set.
li %>% html_elements("i")

# When applied to a node set, html_element() always returns a vector the
# same length as the input, using a "missing" element where needed.
li %>% html_element("i")
# and html_text() and html_attr() will return NA
li %>% html_element("i") %>% html_text2()
li %>% html_element("span") %>% html_attr("class")

Run the code above in your browser using DataLab