s3-ocr: Extract text from PDF files stored in an S3 bucket
Posted by jpluimers on 2024/07/16
For my link archive: [Wayback/Archive] s3-ocr: Extract text from PDF files stored in an S3 bucket
One reason is archival of books. Even (or maybe especially) in IT, books already have historic meaning especially in narrower fields where they often are not available in the Internet Archive or have been scanned by Google Books.
Via/related:
- [Wayback/Archive] Simon Willison on Twitter: “Here’s an earlier TIL I wrote when I was still figuring out how to use Textract from Python code: …” /
- [Wayback/Archive] Running OCR against a PDF file with AWS Textract | Simon Willison’s TILs
[Wayback/Archive] Textract is the AWS OCR API. It’s very good – I’ve fed it hand-written notes from the 1890s and it read them better than I could.
It can be run directly against JPEG or PNG images up to 5MB, but if you want to run OCR against a PDF file you have to first upload it to an S3 bucket.
Update 30th June 2022: I used what I learned in this TIL to build s3-ocr, a command line utility for running OCR against PDFs in an S3 bucket.
- [Wayback/Archive] Running OCR against a PDF file with AWS Textract | Simon Willison’s TILs
- [Wayback/Archive] simonw/s3-ocr: Tools for running OCR against files stored in S3
- [Wayback/Archive] s3-ocr – a tool for Datasette
Tools for running OCR against files stored in S3
Background on this project: s3-ocr: Extract text from PDF files stored in an S3 bucket
- [Wayback/Archive] Simon Willison on Twitter: “I built a new tool: s3-ocr, a utility for running OCR (with Amazon Textract) against every PDF file in a S3 bucket and getting the results back as a searchable SQLite database” (full thread at [Wayback/Archive] Thread by @simonw on Thread Reader App)
- [Wayback/Archive] s3-ocr: Extract text from PDF files stored in an S3 bucket
I’ve released [Wayback/Archive] s3-ocr, a new tool that runs Amazon’s [Wayback/Archive] Textract OCR text extraction against PDF files in an S3 bucket, then writes the resulting text out to a SQLite database with full-text search configured so you can run searches against the extracted data.
You can search through a demo of 697 pages of OCRd text at [Wayback/Archive] s3-ocr-demo.datasette.io/pages/pages.
Textract works extremely well: it handles dodgy scanned PDFs full of typewritten code and reads handwritten text better than I can! It [Wayback/Archive] charges $1.50 per thousand pages processed.
- [Wayback/Archive] s3-ocr: Extract text from PDF files stored in an S3 bucket
- Delphi books you should have or have read
- [Wayback/A] Jeroen Wiert Pluimers on Twitter: “@JenMsft Some books too.” (note each row is three books deep)
- Each row is three books deep (the top row is two binders deep)
- This bookshelf is just one level deep:
–jeroen










Leave a comment