If you have a scanned PDF and you want to be able to search and copy text from it, in this tutorial I will show you how to do it.

  1. First, you need to install tesseract (v4.0 or higher), an OCR engine. You also need to install the language packages, so the program can properly detect the text.
    • In Debian/Ubuntu, these packages are called tesseract-ocr-<language> (e.g.: tesseract-ocr-eng).
    • In Arch Linux, look for tesseract-data-<language> (e.g.: tesseract-data-eng).
    • In Amazon Linux, look for tesseract-langpack-<language>.
  2. Then, install ocrmypdf (https://github.com/ocrmypdf/OCRmyPDF). It’s available in most distros, (on Arch it’s in the AUR) but you can also install it with PIP.
  3. Run:
    ocrmypdf -l <language> <input PDF or image> <output PDF>
    # ocrmypdf -l eng test.pdf test-ocr.pdf
    
    • You can run tesseract --list-langs to see which languages are installed.

If you have any suggestion, feel free to contact me via social media or email.