Microsoft open-sources "the earliest DOS source code discovered to date"

https://arstechnica.com/gadgets/2026/04/microsoft-open-sources-the-earliest-dos-source-code-discovered-to-date/

407DamnInteresting | 22 hours ago | 142 | HN

loading story #48254089

loading story #48255953

loading story #48253410

wow, they had to OCR it back in from paper printouts

> This source code is old enough that it hadn’t been stored digitally. “A dedicated team of historians and preservationists led by Yufeng Gao and Rich Cini,” calling itself the “DOS Disassembly Group,” painstakingly transcribed and scanned in code from paper printouts provided by Paterson. This process was made even more difficult because modern OCR software struggled with the quality of the decades-old printout.

FarmerPotato19 hours ago | parent | next

I'd like to hear more about what works in OCR of dot-matrix fonts.

I've been able to OCR letter-quality printer output to 97% (mostly Os and Xs problems).

But it seems that machine-learning text-recognition is also now biased to reject computer code because it doesn't look like human language.

ndiddy9 hours ago | root | parent | next

There's a writeup here from one of the people on the team about the work it took to go from the listings to source code. http://cini.classiccmp.org/recoveryblog.htm

> With less-than-satisfactory OCR output, I resorted to a process I used many years ago when converting scans made of old Commodore ROM dumps printed on a Commodore 1515 dot-matrix printer. The process relies on the ASCII OCR output having the same repetitive errors. "B" and "8", "S" and "5" are good examples, as are "l" and "1", and "O" and "0". There are many other similar single-character errors and, when working with x86 code, there are similar errors with instructions like "MOV". This process naturally works better if the output file is monolithic rather than single-page OCR conversions because you can do substitutions across the entire converted printout and not 75 separate files.

> The next formatting hassle was the spacing. This required repetitive substitutions of a descending numbers of spaces to tabs (i.e., replace 8 spaces with a tab, 7, 6, etc.). Then if you want to return it to fixed spaces (which is likely how the original printer printed it -- spaces and not vertical tabs), you can. For pure re-creation work, spaces produce absolute column formatting while tabs can move around depending on the program displaying the file.

> Once you run thought the 15 or so common global substitutions and tab conversion, it's a lot easier to work with the file to fix formatting and perform other cleanup. This is then followed by a line-by-line comparison against the original printouts. Overall I'd say the conversion output quality with this method is very good.

loading story #48258178

accrual6 hours ago | root | parent

Pretty interesting. I wonder if a whitelist against certain columns in the output could help, e.g. this column can only contain valid x86 instructions (e.g. MOV is allowed, M0V is not), this column can only contain hexadecimal (1 is allowed but never "l"), etc. Probably more work than it's worth given the final line-by-line comparison that happens anyway.

loading story #48256123

loading story #48256175

loading story #48255400

SoftTalker20 hours ago | parent

Yet another case where text printed on paper outlived any digital storage.

jshier20 hours ago | root | parent | next

Seems like it was never digitally stored in the first place, and the printed text was barely readable due to age. Not really a big win for paper.

SoftTalker20 hours ago | root | parent | next

Well it had to have been on disk or tape at some point. It wasn't all typed in by hand every time they needed to build a new version.

debesyla16 hours ago | root | parent

unless they used punch cards

Sharlin13 hours ago | root | parent | next

Punch cards are still a form of digital storage, mind.

wongarsu12 hours ago | root | parent

Also a form of storing things on paper

accrual6 hours ago | root | parent

Reminds me of an old fortune cookie message or meme, something like "digital data is made from analog parts".

loading story #48255406

loading story #48255196

loading story #48254199

loading story #48255306

loading story #48254688

loading story #48255832

loading story #48254018

loading story #48257332

loading story #48253623

loading story #48253396

loading story #48253721

loading story #48256201

loading story #48260502

loading story #48259695

loading story #48257440

loading story #48256087

loading story #48258175

loading story #48253860

loading story #48260923

loading story #48261271

loading story #48258845

loading story #48256904

loading story #48255751

loading story #48255409

loading story #48255838

loading story #48255436

loading story #48254010

loading story #48253960

loading story #48255461

loading story #48253823

loading story #48254057

loading story #48256104

loading story #48256092

loading story #48254711

loading story #48254169

loading story #48253782

loading story #48255474

#visit	13,352,462
#session	74,665
#live-session	0