Add an OCR layer to a PDF with Tesseract and OCRmyPDF
If you have a scanned PDF and you want to be able to search and copy text from it, in this tutorial I will show you how to do it.
- First, you need to install
tesseract
(v4.0 or higher), an OCR engine. You also need to install the language packages, so the program can properly detect the text.- In Debian/Ubuntu, these packages are called
tesseract-ocr-<language>
(e.g.:tesseract-ocr-eng
). - In Arch Linux, look for
tesseract-data-<language>
(e.g.:tesseract-data-eng
). - In Amazon Linux, look for
tesseract-langpack-<language>
.
- In Debian/Ubuntu, these packages are called
- Then, install
ocrmypdf
(https://github.com/ocrmypdf/OCRmyPDF). It’s available in most distros, (on Arch it’s in the AUR) but you can also install it with PIP. - Run:
ocrmypdf -l <language> <input PDF or image> <output PDF> # ocrmypdf -l eng test.pdf test-ocr.pdf
- You can run
tesseract --list-langs
to see which languages are installed.
- You can run
If you have any suggestion, feel free to contact me via social media or email.
Latest tutorials and articles:
Featured content: