Tesseract: extract text from images
Tesseract is a great and easy to use OCR (Optical Character Recognition) tool with support for several languages.
Table of Contents
Installation
Install tesseract
package with your package manager. English language data will probably be installed automatically. To install other languages, look for tesseract-ocr
, tesseract-data
or tesseract-langpack
(depends on the Linux system).
# For Debian/Ubuntu, installing spanish language data
sudo apt install tesseract-ocr-spa
You can also download the language data manually from https://github.com/tesseract-ocr/tessdata_best and copy the files to Tesseract ‘tessdata’ folder (folder path may vary, but is usually one of these: /usr/share/tessdata
or /usr/share/tesseract/tessdata
).
Usage
Usage is very simple, just type tesseract
, specifying the input image, the basename of the output file and the text language (with -l
). If you don’t specify a language, it will search for english text.
tesseract image.jpg output-text -l spa
- You can replace output file basename with
-
to use standard output.
You can check which languages are installed with tesseract --list-langs
. Language data are usually located inside /usr/share/tessdata
.
By default tesseract
creates a text file, but you can create a searchable PDF from the image:
tesseract image.jpg outputfile -l spa pdf
For more info about tesseract usage, check its man page.
If you have any suggestion, feel free to contact me via social media or email.
Latest tutorials and articles:
Featured content: