Hacker News new | past | comments | ask | show | jobs | submit

Microsoft open-sources "the earliest DOS source code discovered to date"

https://arstechnica.com/gadgets/2026/04/microsoft-open-sources-the-earliest-dos-source-code-discovered-to-date/
loading story #48254089
loading story #48255953
loading story #48253410
wow, they had to OCR it back in from paper printouts

> This source code is old enough that it hadn’t been stored digitally. “A dedicated team of historians and preservationists led by Yufeng Gao and Rich Cini,” calling itself the “DOS Disassembly Group,” painstakingly transcribed and scanned in code from paper printouts provided by Paterson. This process was made even more difficult because modern OCR software struggled with the quality of the decades-old printout.

I'd like to hear more about what works in OCR of dot-matrix fonts.

I've been able to OCR letter-quality printer output to 97% (mostly Os and Xs problems).

But it seems that machine-learning text-recognition is also now biased to reject computer code because it doesn't look like human language.

There's a writeup here from one of the people on the team about the work it took to go from the listings to source code. http://cini.classiccmp.org/recoveryblog.htm

> With less-than-satisfactory OCR output, I resorted to a process I used many years ago when converting scans made of old Commodore ROM dumps printed on a Commodore 1515 dot-matrix printer. The process relies on the ASCII OCR output having the same repetitive errors. "B" and "8", "S" and "5" are good examples, as are "l" and "1", and "O" and "0". There are many other similar single-character errors and, when working with x86 code, there are similar errors with instructions like "MOV". This process naturally works better if the output file is monolithic rather than single-page OCR conversions because you can do substitutions across the entire converted printout and not 75 separate files.

> The next formatting hassle was the spacing. This required repetitive substitutions of a descending numbers of spaces to tabs (i.e., replace 8 spaces with a tab, 7, 6, etc.). Then if you want to return it to fixed spaces (which is likely how the original printer printed it -- spaces and not vertical tabs), you can. For pure re-creation work, spaces produce absolute column formatting while tabs can move around depending on the program displaying the file.

> Once you run thought the 15 or so common global substitutions and tab conversion, it's a lot easier to work with the file to fix formatting and perform other cleanup. This is then followed by a line-by-line comparison against the original printouts. Overall I'd say the conversion output quality with this method is very good.

loading story #48258178
Pretty interesting. I wonder if a whitelist against certain columns in the output could help, e.g. this column can only contain valid x86 instructions (e.g. MOV is allowed, M0V is not), this column can only contain hexadecimal (1 is allowed but never "l"), etc. Probably more work than it's worth given the final line-by-line comparison that happens anyway.
loading story #48256123
loading story #48256175
loading story #48255400
Yet another case where text printed on paper outlived any digital storage.
Seems like it was never digitally stored in the first place, and the printed text was barely readable due to age. Not really a big win for paper.
Well it had to have been on disk or tape at some point. It wasn't all typed in by hand every time they needed to build a new version.
unless they used punch cards
Punch cards are still a form of digital storage, mind.
Also a form of storing things on paper
Reminds me of an old fortune cookie message or meme, something like "digital data is made from analog parts".
loading story #48255406
loading story #48255196
loading story #48254199
loading story #48255306
loading story #48254688
loading story #48255832
loading story #48254018
loading story #48257332
loading story #48253623
loading story #48253396
loading story #48253721
loading story #48256201
loading story #48260502
loading story #48259695
loading story #48257440
loading story #48256087
loading story #48258175
loading story #48253860
loading story #48260923
loading story #48261271
loading story #48258845
loading story #48256904
loading story #48255751
loading story #48255409
loading story #48255838
loading story #48255436
loading story #48254010
loading story #48253960
loading story #48255461
loading story #48253823
loading story #48254057
loading story #48256104
loading story #48256092
loading story #48254711
loading story #48254169
loading story #48253782
loading story #48255474