Tesseract (software): amazing command-line OCR tool
Posted by jpluimers on 2022/05/13
A twitter post blasted me away by showing the results of Tesseract (software) – Wikipedia doing perfect OCR on an image from a twitter post:
[Wayback/Archive] Harrie Baken on Twitter: “Fantastic!… “ responding to
[Wayback/Archive] digiforpw on Twitter: “@nixcraft
curl -s 'pbs.twimg.com/media/E9T96Q9XIAcs8xJ?format=jpg&name=large' -o - | tesseract stdin stdout | grep --color 609
“Note the second tweet with image is gone, but since the image is in the Wayback machine, this still works:
curl -s 'https://web.archive.org/web/20210822124834if_/https://pbs.twimg.com/media/E9T96Q9XIAcs8xJ?format=jpg&name=large' -o - | tesseract stdin stdout | grep --color 609
It instantly solved this puzzle:
[Wayback/Archive.is] Dave Royal 🎧 on Twitter: “Only people with great eyesight can find the intruder GO ON!!!! 🧐… “
Earlier, I quoted a bit of the SikuliX documentation in RaiMan’s SikuliX: Automate what you see on a computer monitor that already mentioned Tesseract, but I looked over it.
It is amazing, and has been around for so long that I felt like living under a stone!
Anyway: it’s available on many platforms, and you can find the source at [Archive.is] tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
This package contains an OCR engine –
libtesseract
and a command line program –tesseract
. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (–oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.
While writing, there were various 5.x test releases [Archive.is].
There are wrappers/ports around it in many programming languages, some of which allow a less basic user experience, like for instance a GUI.
- [Wayback/Archive.is] Tesseract.js | Pure Javascript OCR for 100+ Languages! with source code at [Wayback/Archive.is] GitHub – naptha/tesseract.js: Pure Javascript OCR for 100+ Languages 📖🎉🖥
Tesseract.js wraps an emscripten port of the Tesseract OCR Engine. It works in the browser using webpack or plain script tags with a CDN and on the server with Node.js.
- [Wayback/Archive.is] pytesseract · PyPI with source code at [Wayback/Archive.is] GitHub – madmaze/pytesseract: A Python wrapper for Google Tesseract.
Examples of both:
- [Archive.is] Cengiz Can on Twitter: “I truly hate JS but I had to do this:
jsitor.com/08Xempd1J_
… “
[Wayback] JavaScript Tesseract.recognize example
(function () { let result = document.getElementById("result"); Tesseract.recognize( 'https://web.archive.org/web/20210822193147/https://pbs.twimg.com/media/E9T96Q9XIAcs8xJ?format=jpg&name=large', 'eng', { logger: m => result.innerHTML = "Working..." } ).then(({ data: { text } }) => { result.innerHTML = text.replace(/(609)/g, "...") }) })();
- [Wayback/Archive.is] Riccardo Pietri on Twitter: “Or people with pytesseract ;-)… “
–jeroen
Via:
Leave a Reply