pdf_ocr_text

pdf_ocr_data

Perform OCR text extraction. This requires you have the <code>tesseract</code> package.

Utilities based on 'libpoppler' for extracting text, fonts, attachments and
metadata from a PDF file. Also supports high quality rendering of PDF documents into
PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

Jeroen Ooms

pdftools

Text Extraction, Rendering and Converting of PDF Documents

pdf_ocr_text function

<dl><dt>pdf</dt>
<dd>file path or raw vector with pdf data</dd>
<dt>pages</dt>
<dd>which pages of the pdf file to extract</dd>
<dt>opw</dt>
<dd>string with owner password to open pdf</dd>
<dt>upw</dt>
<dd>string with user password to open pdf</dd>
<dt>dpi</dt>
<dd>resolution to render image that is passed to pdf_convert.</dd>
<dt>language</dt>
<dd>passed to tesseract to specify the
languge of the engine.</dd>
<dt>options</dt>
<dd>passed to tesseract to specify OCR parameters</dd></dl>

Arguments

OCR text extraction — pdf_ocr_text

<dl>

<dt>pdf</dt>
<dd>file path or raw vector with pdf data</dd>


<dt>pages</dt>
<dd>which pages of the pdf file to extract</dd>


<dt>opw</dt>
<dd>string with owner password to open pdf</dd>


<dt>upw</dt>
<dd>string with user password to open pdf</dd>


<dt>dpi</dt>
<dd>resolution to render image that is passed to pdf_convert.</dd>


<dt>language</dt>
<dd>passed to tesseract to specify the
languge of the engine.</dd>


<dt>options</dt>
<dd>passed to tesseract to specify OCR parameters</dd>

</dl>

pdf_ocr_text: OCR text extraction

Description

Usage

Arguments

See Also