Learn R Programming

boilerpipeR (version 1.3.2)

boilerpipeR-package: Extract the main content from HTML files

Description

boilerpipeR interfaces the boilerpipe Java library, created by Christian Kohlschutter https://github.com/kohlschutter/boilerpipe. It implements robust heuristics to extract the main content from HTML files, removing unessecary elements like ads, banners and headers/footers.

Arguments

See Also

Extractor DefaultExtractor ArticleExtractor

Examples

Run this code
# NOT RUN {
data(content)
extract <- DefaultExtractor(content)
cat(extract)
# }

Run the code above in your browser using DataLab