s3-ocr: Extract text from PDF files stored in an S3 bucket

All categories

July 2024
M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

s3-ocr: Extract text from PDF files stored in an S3 bucket

Posted by jpluimers on 2024/07/16

For my link archive: [Wayback/Archive] s3-ocr: Extract text from PDF files stored in an S3 bucket

One reason is archival of books. Even (or maybe especially) in IT, books already have historic meaning especially in narrower fields where they often are not available in the Internet Archive or have been scanned by Google Books.

Via/related:

[Wayback/Archive] Simon Willison on Twitter: “Here’s an earlier TIL I wrote when I was still figuring out how to use Textract from Python code: …” /
- [Wayback/Archive] Running OCR against a PDF file with AWS Textract | Simon Willison’s TILs
  
  [Wayback/Archive] Textract is the AWS OCR API. It’s very good – I’ve fed it hand-written notes from the 1890s and it read them better than I could.
  
  It can be run directly against JPEG or PNG images up to 5MB, but if you want to run OCR against a PDF file you have to first upload it to an S3 bucket.
  
  Update 30th June 2022: I used what I learned in this TIL to build s3-ocr, a command line utility for running OCR against PDFs in an S3 bucket.

[Wayback/Archive] simonw/s3-ocr: Tools for running OCR against files stored in S3
[Wayback/Archive] s3-ocr – a tool for Datasette

Tools for running OCR against files stored in S3

Background on this project: s3-ocr: Extract text from PDF files stored in an S3 bucket
[Wayback/Archive] Simon Willison on Twitter: “I built a new tool: s3-ocr, a utility for running OCR (with Amazon Textract) against every PDF file in a S3 bucket and getting the results back as a searchable SQLite database” (full thread at [Wayback/Archive] Thread by @simonw on Thread Reader App)
- [Wayback/Archive] s3-ocr: Extract text from PDF files stored in an S3 bucket
  
  I’ve released [Wayback/Archive] s3-ocr, a new tool that runs Amazon’s [Wayback/Archive] Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.
  
  You can search through a demo of 697 pages of OCRd text at [Wayback/Archive] s3-ocr-demo.datasette.io/pages/pages.
  
  Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It [Wayback/Archive] charges $1.50 per thousand pages processed.
Delphi books you should have or have read
[Wayback/A] Jeroen Wiert Pluimers on Twitter: “@JenMsft Some books too.” (note each row is three books deep)
- Each row is three books deep (the top row is two binders deep)
- This bookshelf is just one level deep:

–jeroen

This entry was posted on 2024/07/16 at 12:00 and is filed under Amazon S3, AWS Amazon Web Services, Cloud, Cloud Apps, Development, Infrastructure, Internet, Power User, Python, Scripting, Software Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…
	Thaddy de Koning on Formulier voor bewindvoerders…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

s3-ocr: Extract text from PDF files stored in an S3 bucket

Leave a comment Cancel reply

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

s3-ocr: Extract text from PDF files stored in an S3 bucket

Rate this:

Share this:

Related

Leave a comment Cancel reply