tesseract
Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text.
Hello World
Simple example
text <- ocr("http://jeroenooms.github.io/images/testocr.png")
cat(text)
Roundtrip test: render PDF to image and OCR it back to text
library(pdftools)
library(tiff)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")
# Extract text from images
out <- ocr("page.tiff")
cat(out)
Installation
On Windows and MacOS the package binary package can be installed from CRAN:
install.packages("tesseract")
Installation from source on Linux or OSX requires the Tesseract
library (see below).
Install from source
On Debian or Ubuntu install libtesseract-dev and libleptonica-dev. Also install tesseract-ocr-eng to run english examples.
sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng
On Fedora we need tesseract-devel and leptonica-devel
sudo yum install tesseract-devel leptonica-devel
On RHEL and CentOS we need tesseract-devel and leptonica-devel from EPEL
sudo yum install epel-release
sudo yum install tesseract-devel leptonica-devel
On OS-X use tesseract from Homebrew:
brew install tesseract
Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other langauges you can to install the training data from your distribution. For example to install the spanish training data:
- tesseract-ocr-spa (Debian, Ubuntu)
- tesseract-langpack-spa (Fedora, EPEL)
On other platforms you can manually download training data from github
and store it in a path on disk that you pass in the datapath
parameter. Alternatively
you can set a default path via the TESSDATA_PREFIX
environment variable.