Learn R Programming

boilerpipeR (version 1.3.2)

Interface to the Boilerpipe Java Library

Description

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Copy Link

Version

Install

install.packages('boilerpipeR')

Monthly Downloads

289

Version

1.3.2

License

Apache License (== 2.0)

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

May 19th, 2021

Functions in boilerpipeR (1.3.2)

KeepEverythingExtractor

Marks everything as content.
CanolaExtractor

A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.
DefaultExtractor

A quite generic full-text extractor.
ArticleExtractor

A full-text extractor which is tuned towards news articles.
boilerpipeR-package

Extract the main content from HTML files
content

Extractor

Generic extraction function which calls boilerpipe extractors
LargestContentExtractor

A full-text extractor which extracts the largest text component of a page.
ArticleSentencesExtractor

A full-text extractor which is tuned towards extracting sentences from news articles.
NumWordsRulesExtractor

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).