boilerpipeR (version 1.3.2)

Interface to the Boilerpipe Java Library

Description

Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.

Copy Link

Version

Install

install.packages('boilerpipeR')

Monthly Downloads

308

Version

1.3.2

License

Apache License (== 2.0)

Issues

Pull Requests

Stars

Forks

Repository

https://github.com/mannau/boilerpipeR

Maintainer

Mario Annau

Last Published

May 19th, 2021

Functions in boilerpipeR (1.3.2)

KeepEverythingExtractor

Marks everything as content.

CanolaExtractor

A full-text extractor trained on a 'krdwrd' Canola (see https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf.

DefaultExtractor

A quite generic full-text extractor.

ArticleExtractor

A full-text extractor which is tuned towards news articles.

boilerpipeR-package

Extract the main content from HTML files

content

Wordpress generated Webpage (retrieved from Quantivity Blog https://quantivity.wordpress.com). Content is saved as character and ready to be extracted.

Extractor

Generic extraction function which calls boilerpipe extractors

LargestContentExtractor

A full-text extractor which extracts the largest text component of a page.

ArticleSentencesExtractor

A full-text extractor which is tuned towards extracting sentences from news articles.

NumWordsRulesExtractor

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).