Show HN: Adventures in OCR
https://blog.medusis.com/38_Adventures+in+OCR.htmlGetting higher quality scans could save you some headaches. Check the Internet Archive. Or, get library copies, and the right camera setup.
Scantailor might help; it lets you semi-automate a chunk of things, with interactive adjustments. I don't know how its deskewing would compare to ImageMagick. The signature marks might be filtered out here.
I wrote out some of my process for handling scans here - https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe should blog about it.
If you get to the point of collaborative proofreading, I highly recommend Semantic Linefeeds - each sentence gets its own line. https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got there by:
* giving each paragraph its own line
* then, linefeed at punctuation, maybe with quotation marks and parentheses? It's been a while
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
I think it's not a bad result, and any minor imperfections could be revised easily in the markdown. My feature to turn the document into presentation slides got a bit confused because of the French language, so some slides ended up getting translated into English. But again, it wouldn't be hard to revise the slide contents using ChatGPT or Claude to make them all either French or English:
https://fixmydocuments.com/api/hosted/m-moires-de-saint-simo...
Funny how vision models would almost be able to one-shot it, modulo some hallucination issues. Some of the research back then ~2020 was starting to use vision models for layout generations.
I wonder what eventually happened with Ocropus which was supposed to help with page segmentation. I was a bit disappointed to see that this article used Google Vision as its OCR engine. I was hoping for something self hosted.
> The main difficulty of the is project lies in correctly identifying page zones; wouldn't it be possible to properly find the zones during the OCR phase itself instead of rebuilding them afterwards?
Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.
[1] https://unstract.com/llmwhisperer/
Examples of extracting complex layout:
But we'll get there eventually with AIs. We'll be able to tell: "Find me the exact font, styles, etc. And re-render it using InDesign (or LaTeX or whatever fancies you), then compare with the source and see what you got wrong. Rinse and repeat".
We'll eventually have the ability to do just that.
Reliably identifying the superscript locations should be enough since they are in the same order as the footnotes.
It's a little early for feature requests... but I would love to see an EPUB edition! It shouldn't be too hard once done with the hard work of getting the data structured structured.
I'm curious: did you also check whether the signature mark was indeed found every 16 pages? Were there any scans missing?
Great project btw!
I just use ChatGPT for spelling fixes (i.e. when rewriting articles). You just have to instruct it to NOT auto-rephrase the article.
In context: heh.
(I know, typo not OCR-o, but still...)