If you have a scanned PDF and you want to be able to search and copy text from it, in this tutorial I will show you how to do it.

  1. First, you need to install tesseract, an OCR engine. You also need to install the language packages, so the program can properly detect the text.
    • In Debian/Ubuntu, these packages are called tesseract-ocr-<language> (e.g.: tesseract-ocr-eng).
    • In Arch Linux, look for tesseract-data-<language> (e.g.: tesseract-data-eng).
  2. Then, install ocrmypdf. It’s available in most distros, but in Arch it’s not in the official repositories (it’s in the AUR).

  3. Run:
    ocrmypdf -l <language> <input PDF> <output PDF>
    # ocrmypdf -l eng test.pdf test-ocr.pdf
    
    • You can run tesseract --list-langs to see which languages are installed.