Show HN: Adventures in OCR

https://blog.medusis.com/38_Adventures+in+OCR.html

120bambax | 3 days ago | 45 | HN

Oh wow! I've worked on turning PAIP (Paradigms of Artificial Intelligence Programming) from a book into a bunch of Markdown files, but that's "only" about a thousand pages long, compared to the roughly 27000 pages long of all those volumes. I have advice, possibly helpful, possibly not.

Getting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.

Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.

I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.

If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:

* giving each paragraph its own line

* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while

loading story #42446155

loading story #42448649

ksampath023 days ago | parent | next

You could try Aryn DocParse, which segments your documents first before running OCR: https://www.aryn.ai/ (full disclosure: I work there).

loading story #42445167

eigenvalue2 days ago | parent | next

Out of curiosity, I tried submitting the first 200 pages of the PDF he used to my new tool that I also submitted today [0] to Show HN, ( fixmydocuments.com ), and it generated the following without any further interaction besides submitting the PDF file:

https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...

I think it's not a bad result, and any minor imperfections could be revised easily in the markdown. My feature to turn the document into presentation slides got a bit confused because of the French language, so some slides ended up getting translated into English. But again, it wouldn't be hard to revise the slide contents using ChatGPT or Claude to make them all either French or English:

https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...

[0] https://news.ycombinator.com/item?id=42453651

loading story #42456036

lassenordahl3 days ago | parent | next

OCR to original structure is a really fun problem! I did something similar in an internship for newspapers pre-LLM Vision models, and it ended up being a bunch of interval problems re-aligning and formatting the extracted text. Found that Azure's OCR model was the most accurate by bounding box, which helped a lot.

Funny how vision models would almost be able to one-shot it, modulo some hallucination issues. Some of the research back then ~2020 was starting to use vision models for layout generations.

loading story #42446765

bondeau1 day ago | parent | next

I’ve used Surya (https://github.com/VikParuchuri/surya) before. It is very good (on par with Google Vision, potentially better layout analysis), but yours is a challenging use case. I wonder if it would be useful.

throwaway815233 days ago | parent | next

You could upload the books to the Internet Archive and let their OCR pipeline take a try. It is (or at least was) written around Abbyy. Results weren't great but they were a start.

I wonder what eventually happened with Ocropus which was supposed to help with page segmentation. I was a bit disappointed to see that this article used Google Vision as its OCR engine. I was hoping for something self hosted.

loading story #42464298

loading story #42447603

loading story #42446235

loading story #42446311

constantinum3 days ago | parent | next

> The "best" models just made stuff up to meet the requirements. They lied in three ways:

> The main difficulty of the is project lies in correctly identifying page zones; wouldn't it be possible to properly find the zones during the OCR phase itself instead of rebuilding them afterwards?

Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.

[1] https://unstract.com/llmwhisperer/

Examples of extracting complex layout:

https://imgur.com/a/YQMkLpA

https://imgur.com/a/NlZOrtX

https://imgur.com/a/htIm6cf

loading story #42450014

loading story #42464839

TacticalCoder3 days ago | parent | next

If it's to be really 100% automated I don't think there's much solution besides recreating the exact layout, using the very same font, and then superimposing the "OCR then re-rendered" text with the original scan and see if they're close enough. This means finding the various fonts, sizes, types (italic, bold, etc.).

But we'll get there eventually with AIs. We'll be able to tell: "Find me the exact font, styles, etc. And re-render it using InDesign (or LaTeX or whatever fancies you), then compare with the source and see what you got wrong. Rinse and repeat".

We'll eventually have the ability to do just that.

complexworld3 days ago | parent | next

Getting the footnotes right is going to be really tricky. Sometimes I couldn't even read the superscript numbering on the original scans. And that was after zooming in to the max.

Reliably identifying the superscript locations should be enough since they are in the same order as the footnotes.

It's a little early for feature requests... but I would love to see an EPUB edition! It shouldn't be too hard once done with the hard work of getting the data structured structured.

loading story #42443419

gregschlom3 days ago | parent | next

"A very crude method would be to remove the last line every 16 pages but that would not be very robust if there were missing scans or inserts, etc. I prefer to check every last line of every page for the content of the signature mark, and measuring a Levenshtein distance to account for OCR errors."

I'm curious: did you also check whether the signature mark was indeed found every 16 pages? Were there any scans missing?

Great project btw!

loading story #42446056

gregschlom3 days ago | parent | next

For the human review part: maybe crowdsource it? Make the book available for reading online, with a UI to submit corrections (Wikipedia-style).

loading story #42447653

wll3 days ago | parent | next

Use a ~SoTA VLM like Gemini 2.0 Flash on the images. It’ll zero-shot de-hyphenated text in semantic HTML with linked footnotes.

loading story #42445244

loading story #42446092

fschuett3 days ago | parent | next

> After these experiments, it's clear some human review is needed for the text, including spelling fixes and footnote placement.

I just use ChatGPT for spelling fixes (i.e. when rewriting articles). You just have to instruct it to NOT auto-rephrase the article.

joeevans10002 days ago | parent | next

I'm just trying to become literate in AI. Does anyone have any tips or links for how I could use the vectors for building a RAG?

lproven2 days ago | parent | next

> correclty parsing the words

In context: heh.

(I know, typo not OCR-o, but still...)

3 days ago | parent

{"deleted":true,"id":42443101,"parent":42443022,"time":1734455476,"type":"comment"}

#visit	11082085
#session	44985
#live-session	0