Extractor

character specifying the extractor to be used. 
It can take one of the following values:<ul>
<li><code><a rd-options="" href="/link/ArticleExtractor?package=boilerpipeR&version=1.3.2" data-mini-rdoc="boilerpipeR::ArticleExtractor">ArticleExtractor</a></code>A full-text extractor which is tuned towards news articles.</li>
<li><code><a rd-options="" href="/link/ArticleSentencesExtractor?package=boilerpipeR&version=1.3.2" data-mini-rdoc="boilerpipeR::ArticleSentencesExtractor">ArticleSentencesExtractor</a></code>A full-text extractor which is tuned towards extracting sentences from news articles.</li>
<li><code><a rd-options="" href="/link/CanolaExtractor?package=boilerpipeR&version=1.3.2" data-mini-rdoc="boilerpipeR::CanolaExtractor">CanolaExtractor</a></code>A full-text extractor trained on a 'krdwrd'.</li>
<li><code><a rd-options="" href="/link/DefaultExtractor?package=boilerpipeR&version=1.3.2" data-mini-rdoc="boilerpipeR::DefaultExtractor">DefaultExtractor</a></code>A quite generic full-text extractor.</li>
<li><code><a rd-options="" href="/link/KeepEverythingExtractor?package=boilerpipeR&version=1.3.2" data-mini-rdoc="boilerpipeR::KeepEverythingExtractor">KeepEverythingExtractor</a></code>Marks everything as content.</li>
<li><code><a rd-options="" href="/link/LargestContentExtractor?package=boilerpipeR&version=1.3.2" data-mini-rdoc="boilerpipeR::LargestContentExtractor">LargestContentExtractor</a></code>A full-text extractor which extracts the largest text component of a page.</li>
<li><code><a rd-options="" href="/link/NumWordsRulesExtractor?package=boilerpipeR&version=1.3.2" data-mini-rdoc="boilerpipeR::NumWordsRulesExtractor">NumWordsRulesExtractor</a></code>A quite generic full-text extractor solely based upon the number of words per block.</li>
</ul>

exname

content

should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE

asText

It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through
functions as listed for parameter <code>exname</code>.

Generic Extraction of main text content from HTML files; removal
of ads, sidebars and headers using the boilerpipe
<https://github.com/kohlschutter/boilerpipe> Java library. The
extraction heuristics from boilerpipe show a robust performance for a wide
range of web site templates.

Mario Annau

boilerpipeR

Interface to the Boilerpipe Java Library

Extractor function

character specifying the extractor to be used. 
It can take one of the following values:<ul>
<li><code><a rd-options='' href='ArticleExtractor'>ArticleExtractor</a></code>A full-text extractor which is tuned towards news articles.</li>
<li><code><a rd-options='' href='ArticleSentencesExtractor'>ArticleSentencesExtractor</a></code>A full-text extractor which is tuned towards extracting sentences from news articles.</li>
<li><code><a rd-options='' href='CanolaExtractor'>CanolaExtractor</a></code>A full-text extractor trained on a 'krdwrd'.</li>
<li><code><a rd-options='' href='DefaultExtractor'>DefaultExtractor</a></code>A quite generic full-text extractor.</li>
<li><code><a rd-options='' href='KeepEverythingExtractor'>KeepEverythingExtractor</a></code>Marks everything as content.</li>
<li><code><a rd-options='' href='LargestContentExtractor'>LargestContentExtractor</a></code>A full-text extractor which extracts the largest text component of a page.</li>
<li><code><a rd-options='' href='NumWordsRulesExtractor'>NumWordsRulesExtractor</a></code>A quite generic full-text extractor solely based upon the number of words per block.</li>
</ul>

Extractor: Generic extraction function which calls boilerpipe extractors

Description

Usage

Arguments

Value

References