Tesseract is a great and easy to use OCR (Optical Character Recognition) tool with support for several languages.

Table of Contents

Installation

Install tesseract package with your package manager. English language data will probably be installed automatically. To install other languages, look for tesseract-ocr, tesseract-data or tesseract-langpack (depends on the Linux system).

# For Debian/Ubuntu, installing spanish language data
sudo apt install tesseract-ocr-spa

You can also download the language data manually from https://github.com/tesseract-ocr/tessdata_best and copy the files to Tesseract ‘tessdata’ folder (folder path may vary, but is usually one of these: /usr/share/tessdata or /usr/share/tesseract/tessdata).

Usage

Usage is very simple, just type tesseract, specifying the input image, the basename of the output file and the text language (with -l). If you don’t specify a language, it will search for english text.

tesseract image.jpg output-text -l spa
  • You can replace output file basename with - to use standard output.

You can check which languages are installed with tesseract --list-langs. Language data are usually located inside /usr/share/tessdata.

By default tesseract creates a text file, but you can create a searchable PDF from the image:

tesseract image.jpg outputfile -l spa pdf

For more info about tesseract usage, check its man page.

Test with this online terminal:

If you have any suggestion, feel free to contact me via social media or email.