Story Detail of id 43055238 | Liveview Hacker News

almostgotcaught1 week ago | on: We were wrong about GPUs

> At this point, optimizing compilers are excellent

the only people that say this are people who don't work on compilers. ask anyone that actually does and they'll tell you most compiler are pretty mediocre (tend to miss a lot of optimization opportunities), some compilers are horrendous, and a few are good in a small domain (matmul).

mandevil1 week ago | parent

It's more that the God of Moore's Law have given us so many transistors that we are essentially always I/O blocked, so it effectively doesn't matter how good our assembly is for all but the most specialized of applications. Good assembly, bad assembly, whatever, the point is that your thread is almost always going to be blocked waiting for I/O (disk, network, human input) rather than something that a fancy optimization of the loop that enables better branch prediction can fix.

almostgotcaught1 week ago | root | parent

> It's more that the God of Moore's Law have given us so many transistors that we are essentially always I/O blocked

this is again just more brash confidence without experience. you're wrong. this is a post about GPUs and so i'll tell you that as a GPU compiler engineer i spend my entire day (work day) staring/thinking about asm in order to affect register pressure and ilp and load/store efficiency etc.

> rather than something that a fancy optimization of the loop

a fancy loop optimization (pipelinig) can fix some problems (load/store efficiency) but create other problems (register pressure). the fundamental fact is NFL theorem applies here fully: you cannot optimize for all programs uniformly.

https://en.wikipedia.org/wiki/No_free_lunch_theorem

godelski1 week ago | root | parent

I just want to second this. Some of my close friends are PL people working on compilers. I was in HPC before coming to ML, having written a fair amount of CUDA kerenls, a lot of parallelism, and dealing with I/O.

While yes, I/O is often a computational bound, I'd be shy to really say that in a consumer space when we aren't installing flash buffers, performing in situ processing, or even pre-fetching. Hell, in many programs I barely even see any caching! TBH, most stuff can greatly benefit from asynchronous and/or parallel operations. Yeah, I/O is an issue, but I really would not call anything I/O bound until you've actually gotten into parallelism and optimizing code. And even not until you apply this to your I/O operations! There is just so much optimization that a compiler can never do, and so much optimization that a compiler won't do unless you're giving it tons of hints (all that "inline", "const", and stuff you see in C. Not to mention the hell that is template metaprogramming). Things you could never get out of a non-typed language like python, no matter how much of the backend is written in C.

That said, GPU programming is fucking hard. Godspeed you madman, and thank you for your service.

#visit	12106011
#session	46834
#live-session	0