Add an OCR layer to a PDF with Tesseract and OCRmyPDF

Last updated: Nov 11, 2021

Author: Ricardo Sanz

If you have a scanned PDF and you want to be able to search and copy text from it, in this tutorial I will show you how to do it.

First, you need to install tesseract (v4.0 or higher), an OCR engine. You also need to install the language packages, so the program can properly detect the text.
- In Debian/Ubuntu, these packages are called tesseract-ocr-<language> (e.g.: tesseract-ocr-eng).
- In Arch Linux, look for tesseract-data-<language> (e.g.: tesseract-data-eng).
- In Amazon Linux, look for tesseract-langpack-<language>.
Then, install ocrmypdf (https://github.com/ocrmypdf/OCRmyPDF). It’s available in most distros, (on Arch it’s in the AUR) but you can also install it with PIP.

Run:

ocrmypdf -l <language> <input PDF or image> <output PDF>
# ocrmypdf -l eng test.pdf test-ocr.pdf

If you have any suggestion, feel free to contact me via social media or email.

Creating static websites with Astro