getPDF: Extract text from PDF files and return a word-occurrence data.frame.
Description
getPDF
returns a word-occurrence data.frame from PDF files.
It needs XPDF
in order to run (http://www.foolabs.com/xpdf/download.html),
and uses parallel
to perform parallel computation.
Usage
getPDF(
myPDFs,
minword = 1,
maxword = 20,
minFreqWord = 1,
pathToPdftotext = ""
)
Value
A list of list with word-occurrence data.frame and file name.
Arguments
- myPDFs
A character vector containing PDF file names.
- minword
An integer specifying the minimum number of letters per word
into the returned data.frame.
- maxword
An integer to specifying the maximum number of letters per
word into the returned data.frame.
- minFreqWord
An integer specifying the minimum word frequency into the
returned data.frame.
- pathToPdftotext
A character containing an alternative path to XPDF
pdftotext
function, see Details section.
Details
getPDF
uses XPDF pdftotext
function to extract the
content of PDF files into a TXT file. If pdftotext
is not in the
PATH
, an alternative is to provide the full path of the program into
the pathToPdftotext
parameter.
Examples
Run this codeif (FALSE) {
getPDF(myPDFs = "mypdf.pdf")
}
Run the code above in your browser using DataLab