Add an OCR layer to a PDF with Tesseract and OCRmyPDF
Table of Contents
If you have a scanned PDF and you want to be able to search and copy text from it, in this tutorial I will show you how to do it.
- First, you need to install
tesseract(v4.0 or higher), an OCR engine. You also need to install the language packages, so the program can properly detect the text.
- In Debian/Ubuntu, these packages are called
tesseract-ocr-<language>(e.g.:tesseract-ocr-eng). - In Arch Linux, look for
tesseract-data-<language>(e.g.:tesseract-data-eng). - In Amazon Linux, look for
tesseract-langpack-<language>.
- Then, install
ocrmypdf(https://github.com/ocrmypdf/OCRmyPDF). It’s available in most distros, (on Arch it’s in the AUR) but you can also install it with PIP. - Run:
ocrmypdf -l <language> <input PDF or image> <output PDF>
# ocrmypdf -l eng test.pdf test-ocr.pdf
- You can run
tesseract --list-langsto see which languages are installed.
If you have any suggestion, feel free to contact me via social media or email.
Latest tutorials and articles:
Featured content: