Hacker News new | past | comments | ask | show | jobs | submit
re comments:

yes of course this is apples to oranges but that's kind of the point

it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU

the interesting thing is why that is so

CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …

A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.

AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

loading story #48250750
loading story #48250412
loading story #48249895
A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

loading story #48248366