Hacker News new | past | comments | ask | show | jobs | submit
Single core vs multi core accounts for much of this
Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.

The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.

See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.

loading story #48248114