Learn R Programming

tm (version 0.5-9.1)

readPDF: Read In a PDF Document

Description

Return a function which reads in a portable document format (PDF) document extracting both its text and its meta data.

Usage

readPDF(PdftotextOptions = "", ...)

Arguments

PdftotextOptions
Options passed over to pdftotext.
...
Arguments for the generator function.

Value

  • A function with the signature elem, language, id:
  • elemA list with the named element uri of type character which must hold a valid file name.
  • languageA character vector giving the text's language.
  • idA character vector representing a unique identification string for the returned text document.
  • The function returns a PlainTextDocument representing the text and meta data in content.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., options to pdftotext) via lexical scoping.

Note that this PDF reader needs both the tools pdfinfo and pdftotext installed and accessible on your system, available as command line utilities in the Poppler PDF rendering library (see http://poppler.freedesktop.org/).

See Also

getReaders to list available reader functions.

Examples

Run this code
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
    uri <- system.file(file.path("doc", "tm.pdf"), package = "tm")
    pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = uri),
                                                 language = "en",
                                                 id = "id1")
    pdf[1:13]
}

Run the code above in your browser using DataLab