Story Detail of id 48247895 | Liveview Hacker News

tosh7 hours ago | on: Making Deep Learning Go Brrrr from First Principles (2022)

re comments:

yes of course this is apples to oranges but that's kind of the point

it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU

the interesting thing is why that is so

CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …

p1esk7 hours ago | parent

A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.

AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

loading story #48250750

loading story #48250412

loading story #48249895

tosh6 hours ago | root | parent

A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

loading story #48248366

#visit	13,336,637
#session	74,665
#live-session	0