Last chance! 50% off unlimited learning
Sale ends in
readPDF(engine = c("xpdf", "Rpoppler", "ghostscript", "Rcampdf", "custom"), control = list(info = NULL, text = NULL))
info
and text
(see Details).function
with the following formals:
elem
uri
which must
hold a valid file name.language
id
PlainTextDocument
representing the text
and metadata extracted from elem$uri
.
engine
and control
options) via lexical scoping.Available PDF extraction engines are as follows.
"xpdf"
pdfinfo
and
pdftotext
executables which must be installed and accessible on
your system. Suitable utilities are provided by the Xpdf
(http://www.foolabs.com/xpdf/) PDF viewer or by the
Poppler (http://poppler.freedesktop.org/) PDF rendering
library.
"Rpoppler"
PDF_info
and
PDF_text
in package Rpoppler.
"ghostscript"
"Rcampdf"
pdf_info
and pdf_text
in package Rcampdf, available from the repository at
http://datacube.wu.ac.at."custom"
Control parameters for engine "xpdf"
are as follows.
info
pdfinfo
executable.
text
pdftotext
executable.
Control parameters for engine "custom"
are as follows.
info
Author
(as character string),
CreationDate
(of class POSIXlt
), Subject
(as
character string), Title
(as character string), and Creator
(as character string).
text
Reader
for basic information on the reader infrastructure
employed by package tm.
uri <- sprintf("file://%s", system.file(file.path("doc", "tm.pdf"), package = "tm"))
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:13]
}
VCorpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
Run the code above in your browser using DataLab