Rcrawler: Rcrawler

Description

The crawler's main function, by providing only the website URL and the Xpath or CSS selector patterns this function can crawl the whole website (traverse all web pages) download webpages, and scrape/extract its contents in an automated manner to produce a structured dataset. The process of a crawling operation is performed by several concurrent processes or nodes in parallel, so it's recommended to use 64bit version of R.

Usage

Rcrawler(Website, no_cores, no_conn, MaxDepth, DIR, RequestsDelay = 0,
  Obeyrobots = FALSE, Useragent, use_proxy = NULL, Encod,
  Timeout = 5, URLlenlimit = 255, urlExtfilter, dataUrlfilter,
  crawlUrlfilter, crawlZoneCSSPat = NULL, crawlZoneXPath = NULL,
  ignoreUrlParams, ignoreAllUrlParams = FALSE, KeywordsFilter,
  KeywordsAccuracy, FUNPageFilter, ExtractXpathPat, ExtractCSSPat,
  PatternsNames, ExcludeXpathPat, ExcludeCSSPat, ExtractAsText = TRUE,
  ManyPerPattern = FALSE, saveOnDisk = TRUE, NetworkData = FALSE,
  NetwExtLinks = FALSE, statslinks = FALSE, Vbrowser = FALSE,
  LoggedSession)

Arguments

Website

character, the root URL of the website to crawl and scrape.

no_cores

integer, specify the number of clusters (logical cpu) for parallel crawling, by default it's the numbers of available cores.

no_conn

integer, it's the number of concurrent connections per one core, by default it takes the same value of no_cores.

MaxDepth

integer, repsents the max deph level for the crawler, this is not the file depth in a directory structure, but 1+ number of links between this document and root document, default to 10.

DIR

character, correspond to the path of the local repository where all crawled data will be stored ex, "C:/collection" , by default R working directory.

RequestsDelay

integer, The time interval between each round of parallel http requests, in seconds used to avoid overload the website server. default to 0.

Obeyrobots

boolean, if TRUE, the crawler will parse the website\'s robots.txt file and obey its rules allowed and disallowed directories.

Useragent

character, the User-Agent HTTP header that is supplied with any HTTP requests made by this function.it is important to simulate different browser's user-agent to continue crawling without getting banned.

use_proxy

object created by httr::use_proxy() function, if you want to use a proxy (does not work with webdriver).

Encod

character, set the website caharacter encoding, by default the crawler will automatically detect the website defined character encoding.

Timeout

integer, the maximum request time, the number of seconds to wait for a response until giving up, in order to prevent wasting time waiting for responses from slow servers or huge pages, default to 5 sec.

URLlenlimit

integer, the maximum URL length limit to crawl, to avoid spider traps; default to 255.

urlExtfilter

character's vector, by default the crawler avoid irrelevant files for data scraping such us xml,js,css,pdf,zip ...etc, it's not recommanded to change the default value until you can provide all the list of filetypes to be escaped.

dataUrlfilter

character's vector, filter Urls to be scraped/collected by one or more regular expression patterns.Useful to control which pages should be collected/scraped, like product, post, detail or category pages if they have a commun URL pattern. without start ^ and end $ regex.

crawlUrlfilter

character's vector, filter Urls to be crawled by one or more regular expression patterns. Useful for large websites to control the crawler behaviour and which URLs should be crawled. For example, In case you want to crawl a website's search resutls (guided/oriented crawling). without start ^ and end $ regex.

crawlZoneCSSPat

one or more css pattern of page sections from where the crawler should gather links to be followed, to avoid navigating through all visible links and to have more control over the crawler behaviour in target website.

crawlZoneXPath

one or more xpath pattern of page sections from where the crawler should gather links to be followed.

ignoreUrlParams

character's vector, the list of Url paremeter to be ignored during crawling. Some URL parameters are ony related to template view if not ignored will cause duplicate page (many web pages having the same content but have different URLs) .

ignoreAllUrlParams,

boolean, choose to ignore all Url parameter after "?" (Not recommended for Non-SEF CMS websites because only the index.php will be crawled)

KeywordsFilter

character vector, For users who desires to scrape or collect only web pages that contains some keywords one or more. Rcrawler calculate an accuracy score based of the number of founded keywords. This parameter must be a vector with at least one keyword like c("mykeyword").

KeywordsAccuracy

integer value range bewteen 0 and 100, used only with KeywordsFilter parameter to determine the accuracy of web pages to collect. The web page Accuracy value is calculated using the number of matched keywords and their occurence.

FUNPageFilter

function, filter out pages to be collected/scraped by a custom function (conditions, prediction, calssification model). This function should take a LinkExtractor object as arument then finally returns TRUE or FALSE.

ExtractXpathPat

character's vector, vector of xpath patterns to match for data extraction process.

ExtractCSSPat

character's vector, vector of CSS selector pattern to match for data extraction process.

PatternsNames

character vector, given names for each xpath pattern to extract.

ExcludeXpathPat

character's vector, one or more Xpath pattern to exclude from extracted content ExtractCSSPat or ExtractXpathPat (like excluding quotes from forum replies or excluding middle ads from Blog post) .

ExcludeCSSPat

character's vector, similar to ExcludeXpathPat but using Css selectors.

ExtractAsText

boolean, default is TRUE, HTML and PHP tags is stripped from the extracted piece.

ManyPerPattern

boolean, ManyPerPattern boolean, If False only the first matched element by the pattern is extracted (like in Blogs one page has one article/post and one title). Otherwise if set to True all nodes matching the pattern are extracted (Like in galleries, listing or comments, one page has many elements with the same pattern )

saveOnDisk

boolean, By default is true, the crawler will store crawled Html pages and extracted data CSV file on a specific folder. On the other hand you may wish to have DATA only in memory.

NetworkData

boolean, If set to TRUE, then the crawler map all the internal hyperlink connections within the given website and return DATA for Network construction using igraph or other tools.(two global variables is returned see details)

NetwExtLinks

boolean, If TRUE external hyperlinks (outlinks) also will be counted on Network edges and nodes.

statslinks

boolean, if TRUE, the crawler counts the number of input and output links of each crawled web page.

Vbrowser

boolean, If TRUE the crawler will use web driver phantomsjs (virtual browser) to fetch and parse web pages instead of GET request

LoggedSession

A loggedin browser session object, created by LoginSession function

Value

The crawling and scraping process may take a long time to finish, therefore, to avoid data loss in the case that a function crashes or stopped in the middle of action, some important data are exported at every iteration to R global environement:

- INDEX: A data frame in global environement representing the generic URL index,including the list of fetched URLs and page details (contenttype,HTTP state, number of out-links and in-links, encoding type, and level).

- A repository in workspace that contains all downloaded pages (.html files)

Data scraping is enabled by setting ExtractXpathPat or ExtractCSSPat parameter:

- DATA: A list of lists in global environement holding scraped contents.

- A csv file 'extracted_contents.csv' holding all extracted data.

If NetworkData is set to TRUE two additional global variables returned by the function are:

- NetwIndex : Vector maps alls hyperlinks (nodes) with a unique integer ID

- NetwEdges : data.frame representing edges of the network, with these column : From, To, Weight (the Depth level where the link connection has been discovered) and Type (1 for internal hyperlinks 2 for external hyperlinks).

Details

To start Rcrawler task you need to provide the root URL of the website you want to scrape, it could be a domain, a subdomain or a website section (eg. http://www.domain.com, http://sub.domain.com or http://www.domain.com/section/). The crawler then will retreive the web page and go through all its internal links. The crawler continue to follow and parse all website's links automatically on the site until all website's pages have been parsed.

The process of a crawling is performed by several concurrent processes or nodes in parallel, So, It is recommended to use R 64-bit version.

For more tutorials check https://github.com/salimk/Rcrawler/

To scrape content with complex character such as Arabic or Chinese, you need to run Sys.setlocale function then set the appropriate encoding in Rcrawler function.

If you want to learn more about web scraper/crawler architecture, functional properties and implementation using R language, Follow this link and download the published paper for free .

Link: http://www.sciencedirect.com/science/article/pii/S2352711017300110

Dont forget to cite Rcrawler paper:

Khalil, S., & Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
 ######### Crawl, index, and store all pages of a websites using 4 cores and 4 parallel requests
 #
 Rcrawler(Website ="http://glofile.com/", no_cores = 4, no_conn = 4)

 ######### Crawl and index the website using 8 cores and 8 parallel requests with respect to
 # robot.txt rules using Mozilla string in user agent.

 Rcrawler(Website = "http://www.example.com/", no_cores=8, no_conn=8, Obeyrobots = TRUE,
 Useragent="Mozilla 3.11")

 ######### Crawl the website using the default configuration and scrape specific data from
 # the website, in this case we need all posts (articles and titles) matching two XPath patterns.
 # we know that all blog posts have datesin their URLs like 2017/09/08 so to avoid
 # collecting category or other pages we can tell the crawler that desired page's URLs
 # are like 4-digit/2-digit/2-digit/ using regular expression.
 # Note thatyou can use the excludepattern  parameter to exclude a node from being
 # extracted, e.g., in the case that a desired node includes (is a parent of) an
 # undesired "child" node. (article having inner ads or menu)

 Rcrawler(Website = "http://www.glofile.com/", dataUrlfilter =  "/[0-9]{4}/[0-9]{2}/",
 ExtractXpathPat = c("//*/article","//*/h1"), PatternsNames = c("content","title"))

 ######### Crawl the website. and collect pages having URLs matching this regular expression
 # pattern (/[0-9]{4}/[0-9]{2}/). Collected pages will be stored in a local repository
 # named "myrepo". And The crawler stops After reaching the third level of website depth.

  Rcrawler(Website = "http://www.example.com/", no_cores = 4, no_conn = 4,
  dataUrlfilter =  "/[0-9]{4}/[0-9]{2}/", DIR = "./myrepo", MaxDepth=3)


 ######### Crawl the website and collect/scrape only webpage related to a topic
 # Crawl the website and collect pages containing keyword1 or keyword2 or both.
 # To crawl a website and collect/scrape only some web pages related to a specific topic,
 # like gathering posts related to Donald trump from a news website. Rcrawler function
 # has two useful parameters KeywordsFilter and KeywordsAccuracy.
 #
 # KeywordsFilter : a character vector, here you should provide keywords/terms of the topic
 # you are looking for. Rcrawler will calculate an accuracy score based on matched keywords
 # and their occurrence on the page, then it collects or scrapes only web pages with at
 # least a score of 1% wich mean at least one keyword is founded one time on the page.
 # This parameter must be a vector with at least one keyword like c("mykeyword").
 #
 # KeywordsAccuracy: Integer value range between 0 and 100, used only in combination with
 # KeywordsFilter parameter to determine the minimum accuracy of web pages to be collected
 # /scraped. You can use one or more search terms; the accuracy will be calculated based on
 # how many provided keywords are found on on the page plus their occurrence rate.
 # For example, if only one keyword is provided c("keyword"), 50% means one occurrence of
 # "keyword" in the page 100% means five occurrences of "keyword" in the page

  Rcrawler(Website = "http://www.example.com/", KeywordsFilter = c("keyword1", "keyword2"))

 # Crawl the website and collect webpages that has an accuracy percentage higher than 50%
 # of matching keyword1 and keyword2.

  Rcrawler(Website = "http://www.example.com/", KeywordsFilter = c("keyword1", "keyword2"),
   KeywordsAccuracy = 50)


 ######### Crawl a website search results
 # In the case of scraping web pages specific to a topic of your interest; The methods
 # above has some disadvantages which are complexity and time consuming as the whole
 # website need to be crawled and each page is analyzed to findout desired pages.
 # As result you may want to make use of the search box of the website and then directly
 # crawl only search result pages. To do so, you may use \code{crawlUrlfilter} and
 # \code{dataUrlfilter} arguments or \code{crawlZoneCSSPat}/\code{CrawlZoneXPath} with
 \code{dataUrlfilter}.
 #- \code{crawlUrlfilter}:what urls shoud be crawled (followed).
 #- \code{dataUrlfilter}: what urls should be collected (download HTML or extract data ).
 #- \code{crawlZoneCSSPat} Or \code{CrawlZoneXPath}: the page section where links to be
     crawled are located.

 # Example1
 # the command below will crawl all result pages knowing that result pages are like :
    http://glofile.com/?s=sur
    http://glofile.com/page/2/?s=sur
    http://glofile.com/page/2/?s=sur
 # so they all have "s=sur" in common
 # Post pages should be crawled also, post urls are like
   http://glofile.com/2017/06/08/placements-quelles-solutions-pour-dper/
   http://glofile.com/2017/06/08/taux-nette-detente/
 # which contain a date format march regex "[0-9]{4}/[0-9]{2}/[0-9]{2}

 Rcrawler(Website = "http://glofile.com/?s=sur", no_cores = 4, no_conn = 4,
 crawlUrlfilter = c("[0-9]{4}/[0-9]{2}/[0-9]d{2}","s=sur"))

 # In addition by using dataUrlfilter we specify that :
 #  1- only post pages should be collected/scraped not all crawled result pages
 #  2- additional urls should not be retreived from post page
 #  (like post urls listed in 'related topic' or 'see more' sections)

 Rcrawler(Website = "http://glofile.com/?s=sur", no_cores = 4, no_conn = 4,
 crawlUrlfilter = c("[0-9]{4}/[0-9]{2}/[0-9]d{2}","s=sur"),
 dataUrlfilter = "[0-9]{4}/[0-9]{2}/[0-9]{2}")

 # Example 2
 # collect job pages from indeed search result of "data analyst"

 Rcrawler(Website = "https://www.indeed.com/jobs?q=data+analyst&l=Tampa,+FL",
  no_cores = 4 , no_conn = 4,
  crawlUrlfilter = c("/rc/","start="), dataUrlfilter = "/rc/")
 # To include related post jobs on each collected post remove dataUrlfilter

 # Example 3
 # One other way to control the crawler behaviour, and to avoid fetching
 # unnecessary links is to indicate to crawler the page zone of interest
 # (a page section from where links should be grabed and crawled).
 # The follwing example is similar to the last one,except this time we provide
 # the xpath pattern of results search section to be crawled with all links within.

 Rcrawler(Website = "https://www.indeed.com/jobs?q=data+analyst&l=Tampa,+FL",
  no_cores = 4 , no_conn = 4,MaxDepth = 3,
  crawlZoneXPath = c("//*[\@id='resultsCol']"), dataUrlfilter = "/rc/")


 ######### crawl and scrape a forum posts and replays, each page has a title and
 # a list of replays , ExtractCSSPat = c("head>title","div[class=\"post\"]") .
 # All replays have the same pattern, therfore we set TRUE ManyPerPattern
 # to extract all of them.

 Rcrawler(Website = "https://bitcointalk.org/", ManyPerPattern = TRUE,
 ExtractCSSPat = c("head>title","div[class=\"post\"]"),
 no_cores = 4, no_conn =4, PatternsName = c("Title","Replays"))


 ######### scrape data/collect pages meeting your custom criteria,
 # This is useful when filetring by keyword or urls does not fullfil your needs, for example
 # if you want to detect target pages  by classification/prediction model, or simply by checking
 # a sppecifi text value/field in the web page, you can create a custom filter function for
 # page selection as follow.
 # First will create and test our function and test it with un one page .

 pageinfo<-LinkExtractor(url="http://glofile.com/index.php/2017/06/08/sondage-quel-budget/",
 encod=encod, ExternalLInks = TRUE)

 Customfilterfunc<-function(pageinfo){
  decision<-FALSE
  # put your conditions here
    if(pageinfo$Info$Source_page ... ) ....
  # then return a boolean value TRUE : should be collected / FALSE should be escaped

  return TRUE or FALSE
 }
  # Finally, you just call it inside Rcrawler function, Then the crawler will evaluate each
   page using your set of rules.

 Rcrawler(Website = "http://glofile.com", no_cores=2, FUNPageFilter= Customfilterfunc )

 ######### Website Network
 # Crawl the entire website, and create network edges DATA of internal links.
 # Using Igraph for exmaple you can plot the network by the following commands

   Rcrawler(Website = "http://glofile.com/" , no_cores = 4, no_conn = 4, NetworkData = TRUE)
   library(igraph)
   network<-graph.data.frame(NetwEdges, directed=T)
   plot(network)

  # Crawl the entire website, and create network edges DATA of internal and external links .
  Rcrawler(Website = "http://glofile.com/" , no_cores = 4, no_conn = 4, NetworkData = TRUE,
  NetwExtLinks = TRUE)

###### Crawl a website using a web driver (Vitural browser)
###########################################################################
## In some case you may need to retreive content from a web page which
## requires authentication via a login page like private forums, platforms..
## In this case you need to run \link{LoginSession} function to establish a
## authenticated browser session; then use \link{LinkExtractor} to fetch
## the URL using the auhenticated session.
## In the example below we will try to fech a private blog post which
## require authentification .

If you retreive the page using regular function LinkExtractor or your browser
page<-LinkExtractor("http://glofile.com/index.php/2017/06/08/jcdecaux/")
The post is not visible because it's private.
Now we will try to login to access this post using folowing creditentials
username : demo and password : rc@pass@r

#1 Download and install phantomjs headless browser (skip if installed)
install_browser()

#2 start browser process
br <-run_browser()

#3 create auhenticated session
#  see \link{LoginSession} for more details

 LS<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php',
                LoginCredentials = c('demo','rc@pass@r'),
                cssLoginCredentials =c('#user_login', '#user_pass'),
                cssLoginButton='#wp-submit' )

#check if login successful
LS$session$getTitle()
#Or
LS$session$getUrl()
#Or
LS$session$takeScreenshot(file = 'sc.png')
LS$session$getUrl()
LS<-run_browser()
LS<-LoginSession(Browser = LS, LoginURL = 'https://manager.submittable.com/login',
   LoginCredentials = c('your email','your password'),
   cssLoginFields =c('#email', '#password'),
   XpathLoginButton ='//*[\@type=\"submit\"]' )


# page<-LinkExtractor(url='https://manager.submittable.com/beta/discover/119087',
LoggedSession = LS)
# cont<-ContentScraper(HTmlText = page$Info$Source_page,
XpathPatterns = c("//*[\@id=\"submitter-app\"]/div/div[2]/div/div/div/div/div[3]",
"//*[\@id=\"submitter-app\"]/div/div[2]/div/div/div/div/div[2]/div[1]/div[1]" ),
PatternsName = c("Article","Title"),astext = TRUE )

# }
# NOT RUN {

# }

Run the code above in your browser using DataLab