Story Detail of id 43054802 | Liveview Hacker News

daveguy1 week ago | on: We were wrong about GPUs

This makes sense to me. When I optimize, the most significant gains I find are algorithmic. Whether it's an extra call, a data structure that needs to be tweaked, or just utilizing a library that operates closer to silicon. I rarely need to go to assembly or even a lower level language to get acceptable performance. The only exception is occasionally getting into architecture specifics of a GPU. At this point, optimizing compilers are excellent and probably have more architecture details baked into them than I will ever know. Thank you, compiler programmers!

almostgotcaught1 week ago | parent | next

> At this point, optimizing compilers are excellent

the only people that say this are people who don't work on compilers. ask anyone that actually does and they'll tell you most compiler are pretty mediocre (tend to miss a lot of optimization opportunities), some compilers are horrendous, and a few are good in a small domain (matmul).

mandevil1 week ago | root | parent

It's more that the God of Moore's Law have given us so many transistors that we are essentially always I/O blocked, so it effectively doesn't matter how good our assembly is for all but the most specialized of applications. Good assembly, bad assembly, whatever, the point is that your thread is almost always going to be blocked waiting for I/O (disk, network, human input) rather than something that a fancy optimization of the loop that enables better branch prediction can fix.

almostgotcaught1 week ago | root | parent

> It's more that the God of Moore's Law have given us so many transistors that we are essentially always I/O blocked

this is again just more brash confidence without experience. you're wrong. this is a post about GPUs and so i'll tell you that as a GPU compiler engineer i spend my entire day (work day) staring/thinking about asm in order to affect register pressure and ilp and load/store efficiency etc.

> rather than something that a fancy optimization of the loop

a fancy loop optimization (pipelinig) can fix some problems (load/store efficiency) but create other problems (register pressure). the fundamental fact is NFL theorem applies here fully: you cannot optimize for all programs uniformly.

https://en.wikipedia.org/wiki/No_free_lunch_theorem

godelski1 week ago | root | parent

I just want to second this. Some of my close friends are PL people working on compilers. I was in HPC before coming to ML, having written a fair amount of CUDA kerenls, a lot of parallelism, and dealing with I/O.

While yes, I/O is often a computational bound, I'd be shy to really say that in a consumer space when we aren't installing flash buffers, performing in situ processing, or even pre-fetching. Hell, in many programs I barely even see any caching! TBH, most stuff can greatly benefit from asynchronous and/or parallel operations. Yeah, I/O is an issue, but I really would not call anything I/O bound until you've actually gotten into parallelism and optimizing code. And even not until you apply this to your I/O operations! There is just so much optimization that a compiler can never do, and so much optimization that a compiler won't do unless you're giving it tons of hints (all that "inline", "const", and stuff you see in C. Not to mention the hell that is template metaprogramming). Things you could never get out of a non-typed language like python, no matter how much of the backend is written in C.

That said, GPU programming is fucking hard. Godspeed you madman, and thank you for your service.

davemp1 week ago | parent | next

> At this point, optimizing compilers are excellent and probably have more architecture details baked into them than I will ever know.

While modern compilers are great, you’d be surprised about the seemingly obvious optimizations compilers can’t do because of language semantics or the code transformations would be infeasible to detect.

I type versions of functions into godbolt all the time and it’s very interesting to see what code is/isn’t equivalent after O3 passes

fpoling1 week ago | parent

The need to expose SSE instruction to system languages tells that compilers are not good at translating straightforward code into optimal machine code. And using SSE properly allows often to speed up the code by several times.

#visit	12106031
#session	46834
#live-session	0