Learn R Programming

scrapeR (version 0.1.8)

scrapeR_in_batches: Batch Web Page Content Scraper

Description

The scrapeR_in_batches function processes a dataframe in batches, scraping web content from URLs in a specified column and writing the scraped content to a column in df.

Usage

scrapeR_in_batches(df, url_column, extract_contacts)

Value

The values are returned to content column and optionally to an email and phone_number column if extract_contacts is TRUE.

Arguments

df

A dataframe containing the URLs to be scraped.

url_column

The name of the column in df that contains the URLs.

extract_contacts

A function that searches scraped content for emails and phone numbers, defaults to FALSE.

Author

Mathieu Dubeau Ph.D

Details

This function divides the input dataframe into batches of a fixed size (default: 100). For each batch, it extracts the combined text content from the web pages of the URLs in the specified column. The results are appended to the df. The function also includes a throttling mechanism to pause between batch processing, reducing the load on the server being scraped.

References

Refer to rvest package documentation and httr package documentation for underlying web scraping methods.

See Also

Examples

Run this code

  mock_scrapeR <- function(url) {
    return(paste("Scraped content from", url))
  }

  df <- data.frame(url = c("http://site1.com", "http://site2.com"), stringsAsFactors = FALSE)

  if (FALSE) {
    scrapeR_in_batches(df, url_column = "url", extract_contacts = FALSE)
  }

Run the code above in your browser using DataLab