Tesseract: extract text from images
Tesseract is a great and easy to use OCR (Optical Character Recognition) tool with support for several languages.
Table of Contents
tesseract package with your package manager. English language data will probably be installed automatically. To install other languages, look for
tesseract-langpack (depends on the Linux system).
# For Debian/Ubuntu, installing spanish language data sudo apt install tesseract-ocr-spa
Usage is very simple, just type
tesseract, specifying the input image, the basename of the output file and the text language (with
-l). If you don’t specify a language, it will search for english text.
tesseract image.jpg output-text -l spa
- You can replace output file basename with
-to use standard output.
You can check which languages are installed with
tesseract --list-langs. Language data are usually located inside
tesseract creates a text file, but you can create a searchable PDF from the image:
tesseract image.jpg outputfile -l spa pdf
For more info about tesseract usage, check its man page.
If you have any suggestion, feel free to contact me via social media or email.
Latest tutorials and articles: