It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through
functions as listed for parameter exname
.
Extractor(exname, content, asText = TRUE, ...)
character specifying the extractor to be used. It can take one of the following values:
ArticleExtractor
A full-text extractor which is tuned towards news articles.
ArticleSentencesExtractor
A full-text extractor which is tuned towards extracting sentences from news articles.
CanolaExtractor
A full-text extractor trained on a 'krdwrd'.
DefaultExtractor
A quite generic full-text extractor.
KeepEverythingExtractor
Marks everything as content.
LargestContentExtractor
A full-text extractor which extracts the largest text component of a page.
NumWordsRulesExtractor
A quite generic full-text extractor solely based upon the number of words per block.
Text content or URL as character
should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE
additional parameters
extracted text as character