the only people that say this are people who don't work on compilers. ask anyone that actually does and they'll tell you most compiler are pretty mediocre (tend to miss a lot of optimization opportunities), some compilers are horrendous, and a few are good in a small domain (matmul).
this is again just more brash confidence without experience. you're wrong. this is a post about GPUs and so i'll tell you that as a GPU compiler engineer i spend my entire day (work day) staring/thinking about asm in order to affect register pressure and ilp and load/store efficiency etc.
> rather than something that a fancy optimization of the loop
a fancy loop optimization (pipelinig) can fix some problems (load/store efficiency) but create other problems (register pressure). the fundamental fact is NFL theorem applies here fully: you cannot optimize for all programs uniformly.
While yes, I/O is often a computational bound, I'd be shy to really say that in a consumer space when we aren't installing flash buffers, performing in situ processing, or even pre-fetching. Hell, in many programs I barely even see any caching! TBH, most stuff can greatly benefit from asynchronous and/or parallel operations. Yeah, I/O is an issue, but I really would not call anything I/O bound until you've actually gotten into parallelism and optimizing code. And even not until you apply this to your I/O operations! There is just so much optimization that a compiler can never do, and so much optimization that a compiler won't do unless you're giving it tons of hints (all that "inline", "const", and stuff you see in C. Not to mention the hell that is template metaprogramming). Things you could never get out of a non-typed language like python, no matter how much of the backend is written in C.
That said, GPU programming is fucking hard. Godspeed you madman, and thank you for your service.
While modern compilers are great, you’d be surprised about the seemingly obvious optimizations compilers can’t do because of language semantics or the code transformations would be infeasible to detect.
I type versions of functions into godbolt all the time and it’s very interesting to see what code is/isn’t equivalent after O3 passes