Returns a function which reads in a simple HTML
document extracting both its text and its metadata. The reader uses
h1 headings as structure information whereas text and tags
between headings are considered as textual information. Meta data is
extracted from meta tags in the HTML head.
Usage
readHTML(...)
Arguments
...
arguments for the generator function.
Value
A function with the signature elem, language, load, id:
elemA list with the two named elements content
and uri. The first element must hold the document to
be read in, the second element must hold a call to extract this
document. The call is evaluated upon a request for load on demand.
languageA character vector giving the text's language.
loadA logical value indicating whether the document
corpus should be immediately loaded into memory.
idA character vector representing a unique identification
string for the returned text document.
The function returns a StructuredTextDocument representing
content.
Details
Formally this function is a function generator, i.e., it returns a
function (which reads in a text document) with a well-defined
signature, but can access passed over arguments via lexical
scoping. This is especially useful for reader functions for complex
data structures which need a lot of configuration options.
See Also
Use getReaders to list available reader functions.