readHTML: Read In a Simple HTML Document

Description

Returns a function which reads in a simple HTML document extracting both its text and its metadata. The reader uses h1 headings as structure information whereas text and tags between headings are considered as textual information. Meta data is extracted from meta tags in the HTML head.

Usage

readHTML(...)

Arguments

...

arguments for the generator function.

Value

A function with the signature elem, language, load, id:
elemA list with the two named elements content and uri. The first element must hold the document to be read in, the second element must hold a call to extract this document. The call is evaluated upon a request for load on demand.
languageA character vector giving the text's language.
loadA logical value indicating whether the document corpus should be immediately loaded into memory.
idA character vector representing a unique identification string for the returned text document.
The function returns a StructuredTextDocument representing content.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments via lexical scoping. This is especially useful for reader functions for complex data structures which need a lot of configuration options.

Examples

Run this code

html <- system.file("texts", "html", package = "tm")
(Corpus(DirSource(html), readerControl = list(reader = readHTML, load = TRUE)))