Tesseract: extract text from images

Last updated: Nov 30, 2023

Author: Ricardo Sanz

Tesseract is a great and easy to use OCR (Optical Character Recognition) tool with support for several languages.

Installation
Usage

Installation

Install tesseract package with your package manager. English language data will probably be installed automatically. To install other languages, look for tesseract-ocr, tesseract-data or tesseract-langpack (depends on the Linux system).

# For Debian/Ubuntu, installing spanish language data
sudo apt install tesseract-ocr-spa

You can also download the language data manually from https://github.com/tesseract-ocr/tessdata_best and copy the files to Tesseract ‘tessdata’ folder (folder path may vary, but is usually one of these: /usr/share/tessdata or /usr/share/tesseract/tessdata).

Usage

Usage is very simple, just type tesseract, specifying the input image, the basename of the output file and the text language (with -l). If you don’t specify a language, it will search for english text.

tesseract image.jpg output-text -l spa

You can replace output file basename with - to use standard output.

You can check which languages are installed with tesseract --list-langs. Language data are usually located inside /usr/share/tessdata.

By default tesseract creates a text file, but you can create a searchable PDF from the image:

tesseract image.jpg outputfile -l spa pdf

For more info about tesseract usage, check its man page.

Test with this online terminal:

If you have any suggestion, feel free to contact me via social media or email.

Tesseract: extract text from images

Table of Contents

Installation

Usage

Creating static websites with Astro

Speech Note: Text-To-Speech, Speech-To-Text and Translations within the same application

Fixing WebGL issues in LibreWolf

How to run CLI scripts inside a GUI environment

Open source projects to follow (XI)

Convert between several markup formats with Pandoc

timeout: run a command with a time limit

Export a manpage to (almost) any format

RSS readers: read feeds with these graphical and command line tools

RS1 Linux News: news aggregator focused on Linux and open source

Open source projects to follow (X)

Joplin: an awesome note-taking application, available on multiple devices

Mabox Linux: a lightweight Manjaro with Openbox WM

scan4all: a new vulnerability scanner

Using Kali Linux on Linode (VNC)

Alternative search engines: life beyond Google

Ultramarine Linux: Fedora with some useful tweaks

How to run Linux commands on a Google Colab notebook

Limit available system resources per user with Systemd and cgroups

Bliss OS: Android on your PC

Google Colab: some great projects

Quickemu: an alternative to GNOME Boxes for using virtual machines

Running desktop apps on Docker containers: X11 forwarding

List of Linux and FOSS websites