Tesseract (software): amazing command-line OCR tool

All categories

May 2022
M	T	W	T	F	S	S
	1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Tesseract (software): amazing command-line OCR tool

Posted by jpluimers on 2022/05/13

A twitter post blasted me away by showing the results of Tesseract (software) – Wikipedia doing perfect OCR on an image from a twitter post:

[Wayback/Archive] Harrie Baken on Twitter: “Fantastic!… “ responding to

[Wayback/Archive] digiforpw on Twitter: “@nixcraft curl -s 'pbs.twimg.com/media/E9T96Q9XIAcs8xJ?format=jpg&name=large' -o - | tesseract stdin stdout | grep --color 609“

Note the second tweet with image is gone, but since the image is in the Wayback machine, this still works:

curl -s 'https://web.archive.org/web/20210822124834if_/https://pbs.twimg.com/media/E9T96Q9XIAcs8xJ?format=jpg&name=large' -o - | tesseract stdin stdout | grep --color 609

It instantly solved this puzzle:

[Wayback/Archive.is] Dave Royal 🎧 on Twitter: “Only people with great eyesight can find the intruder GO ON!!!! 🧐… “

Earlier, I quoted a bit of the SikuliX documentation in RaiMan’s SikuliX: Automate what you see on a computer monitor that already mentioned Tesseract, but I looked over it.

It is amazing, and has been around for so long that I felt like living under a stone!

Anyway: it’s available on many platforms, and you can find the source at [Archive.is] tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)

This package contains an OCR engine – libtesseract and a command line program – tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (–oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

While writing, there were various 5.x test releases [Archive.is].

There are wrappers/ports around it in many programming languages, some of which allow a less basic user experience, like for instance a GUI.

[Wayback/Archive.is] Tesseract.js | Pure Javascript OCR for 100+ Languages! with source code at [Wayback/Archive.is] GitHub – naptha/tesseract.js: Pure Javascript OCR for 100+ Languages 📖🎉🖥

Tesseract.js wraps an emscripten port of the Tesseract OCR Engine. It works in the browser using webpack or plain script tags with a CDN and on the server with Node.js.
[Wayback/Archive.is] pytesseract · PyPI with source code at [Wayback/Archive.is] GitHub – madmaze/pytesseract: A Python wrapper for Google Tesseract.

Examples of both:

[Archive.is] Cengiz Can on Twitter: “I truly hate JS but I had to do this: jsitor.com/08Xempd1J_… “

[Wayback] JavaScript Tesseract.recognize example

(function () {
  let result = document.getElementById("result");  Tesseract.recognize(
    'https://web.archive.org/web/20210822193147/https://pbs.twimg.com/media/E9T96Q9XIAcs8xJ?format=jpg&name=large',
    'eng',
    { logger: m => result.innerHTML = "Working..." }
  ).then(({ data: { text } }) => {
    result.innerHTML = text.replace(/(609)/g, "...")
  })
})();

[Wayback/Archive.is] Riccardo Pietri on Twitter: “Or people with pytesseract ;-)… “

–jeroen

Via:

I found like 3 intruders. Good exercise for brain and eye too i guess. https://t.co/eqfpehBxMh

— nixCraft 🐧 (@nixcraft) August 22, 2021

This entry was posted on 2022/05/13 at 12:00 and is filed under C++, Color (software development), Development, OCR, Power User, Software Development, Tesseract. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	xyzzy, Relay Confere… on Sad and Useless about Competit…
	jpluimers on Windows warned me of disk full…
	jpluimers on Started making people walk me…
	jpluimers on Stack Overflow’s forum is dead…
	jpluimers on Some links on getting the most…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Tesseract (software): amazing command-line OCR tool

Leave a comment Cancel reply

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Tesseract (software): amazing command-line OCR tool

Rate this:

Share this:

Related

Leave a comment Cancel reply