read_html_live: Live web scraping (with chromote)

Description

read_html() operates on the HTML source code downloaded from the server. This works for most websites but can fail if the site uses javascript to generate the HTML. read_html_live() provides an alternative interface that runs a live web browser (Chrome) in the background. This allows you to access elements of the HTML page that are generated dynamically by javascript and to interact with the live page by clicking on buttons or typing in forms.

Behind the scenes, this function uses the chromote package, which requires that you have a copy of Google Chrome installed on your machine.

Usage

read_html_live(url)

Value

read_html_live() returns an R6 LiveHTML object. You can interact with this object using the usual rvest functions, or call its methods, like $click(), $scroll_to(), and $type() to interact with the live page like a human would.

Arguments

url: Website url to read from.

Examples

Run this code

if (FALSE) {
# When we retrieve the raw HTML for this site, it doesn't contain the
# data we're interested in:
static <- read_html("https://www.forbes.com/top-colleges/")
static %>% html_elements(".TopColleges2023_tableRow__BYOSU")

# Instead, we need to run the site in a real web browser, causing it to
# download a JSON file and then dynamically generate the html:

sess <- read_html_live("https://www.forbes.com/top-colleges/")
sess$view()
rows <- sess %>% html_elements(".TopColleges2023_tableRow__BYOSU")
rows %>% html_element(".TopColleges2023_organizationName__J1lEV") %>% html_text()
rows %>% html_element(".grant-aid") %>% html_text()
}

Run the code above in your browser using DataLab