Learn R Programming

robotstxt (version 0.7.15)

A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

Description

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.

Copy Link

Version

Install

install.packages('robotstxt')

Monthly Downloads

2,066

Version

0.7.15

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Pedro Baltazar

Last Published

August 29th, 2024

Functions in robotstxt (0.7.15)

paths_allowed

check if a bot has permissions to access page(s)
%>%

re-export magrittr pipe operator
http_subdomain_changed

http_subdomain_changed
parse_robotstxt

function parsing robots.txt
get_robotstxt

downloading robots.txt file
rt_last_http

storage for http request response objects
is_suspect_robotstxt

is_suspect_robotstxt
get_robotstxts

function to get multiple robotstxt files
is_valid_robotstxt

function that checks if file is valid / parsable robots.txt file
guess_domain

function guessing domain from path
list_merge

Merge a number of named lists in sequential order
print.robotstxt_text

printing robotstxt_text
remove_domain

function to remove domain from path
named_list

make automatically named list
rt_get_rtxt

load robots.txt files saved along with the package
null_to_defeault

null_to_defeault
paths_allowed_worker_spiderbar

paths_allowed_worker spiderbar flavor
parse_url

parse_url
request_handler_handler

request_handler_handler
rt_cache

get_robotstxt() cache
print.robotstxt

printing robotstxt
rt_get_useragent

extracting HTTP useragents from robots.txt
robotstxt

Generate a representations of a robots.txt file
rt_get_comments

extracting comments from robots.txt
sanitize_path

making paths uniform
rt_get_fields

extracting permissions from robots.txt
rt_get_fields_worker

extracting robotstxt fields
rt_list_rtxt

list robots.txt files saved along with the package
rt_request_handler

rt_request_handler
fix_url

fix_url
http_domain_changed

http_domain_changed
http_was_redirected

http_was_redirected
as.list.robotstxt_text

Method as.list() for class robotstxt_text