xscrape: Extract information from webpages to a data.frame, using XPath or CSS queries.

Description

This function transforms an html/xml page (or list of pages) into a data.frame, extracting nodes specified by their XPath.

Usage

xscrape(pages, 
        col.xpath = ".", row.xpath = "/html", 
        col.css = NULL, row.css = NULL, 
        collapse = " | ", encoding = NULL, 
        page.name = TRUE, nice.text = TRUE, 
        parallel = 0, 
        engine = c("auto", "XML", "xml2"))

Arguments

pages

an object of class XMLInternalDocument or xml_document (as returned by functions XML::htmlParse or xml2::read_html or rvest::read_html), or list of such objects. Alternatively, a character vector containing the URLs or local paths of webpages to be parsed. These are the webpages that information is to be extracted from. If the provided list or vector is named, its names will be used to indicate data provenance when page.name is TRUE.

col.xpath

a character vector of XPath queries used for creating the result columns. If the vector is named, these names are given to the columns. The default "." takes the text from the whole of each page or intermediary node (specified by row.xpath or row.css).

row.xpath

a character string, containing an XPath query for creating the result rows. The result of this query (on each page) becomes a row in the resulting data.frame. If not specified (default), the intermediary nodes are whole html pages, so that each page becomes a row in the result.

col.css

same as col.xpath, but with CSS selectors instead of XPath queries. If col.xpath was also given, the XPath columns will be placed before the CSS columns in the result.

row.css

same as row.xpath, but with a CSS selector instead of an XPath query. If given, this will be used instead of row.xpath.

collapse

a character string, containing the separator that will be used in case a col.xpath query yields multiple results within a given intermediary node. The default is " | ".

encoding

a character string (eg. "UTF-8" or "ISO-8859-1"), containing the encoding parameter that will be used by htmlParse or read_html if pages is a vector of URLs or local file names.

page.name

a logical. If TRUE, the result will contain a column indicating the name of the page each row was extracted from. If pages has no names, they will be numbered from 1 to length(pages)

nice.text

a logical. If TRUE (only possible with engine xml2), the rvest::html_text2 function is used to extract text into the result, often making the text much cleaner. If FALSE, the function runs faster, but the text might be less clean.

parallel

a numeric, indicating the number of cores to use for parallel computation. The default 0 takes all available cores. The parallelization is done on the pages if their number is greater than the number of provided cores, otherwise it is done on the intermediary nodes. Note that parallelization relies on parallel::mclapply, and is thus not supported on Windows systems.

engine

a character string, indicating the engine to use for data extraction: either "XML", "xml2", or "auto" (default). The default will adapt the engine to the type of pages, or will use "xml2" if pages are URLs or file names. Note: CSS selectors and nice.text are only available for "xml2", but XPath queries and "XML" engine tend to be much faster.

Value

A data.frame, where each row corresponds to an intermediary node (either a full page or an XML node within a page, specified by row.xpath or row.css), and each column corresponds to the text of a col.xpath or col.css query.

Details

If a col.xpath or col.css query designs a full node, only its text is extracted. If it designs an attribute (eg. ends with '/@href' for weblinks), only the attribute's value is extracted.

If a col.xpath or col.css query matches no elements in a page, returned value is NA. If it matches multiple elements, they are concatenated into a single character string, separated by collapse.

Examples

Run this code

# NOT RUN {
## Extract all external links and their titles from a wikipedia page
data(wiki)
wiki.parse <- XML::htmlParse(wiki)
links <- xscrape(wiki.parse, 
                 row.xpath= "//a[starts-with(./@href, 'http')]", 
                 col.xpath= c(title= ".", link= "./@href"), 
                 parallel = 1)

# }
# NOT RUN {
## Convert results from a search for 'R' on duckduckgo.com
## First download the search page
duck <- XML::htmlParse("http://duckduckgo.com/html/?q=R")
## Then run xscrape on the dowloaded and parsed page
results <- xscrape(duck, 
                   row.xpath= "//div[contains(@class, 'result__body')]",
                   col.xpath= c(title= "./h2", 
                                snippet= ".//*[@class='result__snippet']", 
                                url= ".//a[@class='result__url']/@href"))
# }
# NOT RUN {
# }
# NOT RUN {
## Convert results from a search for 'R' and 'Julia' on duckduckgo.com
## Directly provide the URLs to xscrape
results <- xscrape(c("http://duckduckgo.com/html/?q=R", 
                     "http://duckduckgo.com/html/?q=julia"), 
                   row.xpath= "//div[contains(@class, 'result__body')]",
                   col.xpath= c(title= "./h2", 
                                snippet= ".//*[@class='result__snippet']", 
                                url= ".//a[@class='result__url']/@href"))
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab