Story Detail of id 48248075 | Liveview Hacker News

p1esk8 hours ago | on: Making Deep Learning Go Brrrr from First Principles (2022)

A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.

AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

That's also a CPU that came out four years later than the A100. The contemporaneous B200 is not optimized for FP32 and does 74.45 TFLOP/s. For FP16 it's at ~2 PFLOP/s.

p1esk1 hour ago | root | parent

The point is that modern CPUs are not as slow as most DL people think. Roughly 10x slower but with a lot more memory.

zzzoom3 hours ago | parent | next

EPYC 9965: 614GBps of 12-channel DDR5-6400

A100: 1935GBps of HBM2e

Most of those FLOPS are constrained by memory bandwidth.

4 hours ago | parent | next

{"deleted":true,"id":48249895,"parent":48248075,"time":1779560438,"type":"comment"}

tosh7 hours ago | parent

A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

p1esk7 hours ago | root | parent

Intel Xeon 6980P: 128 cores x 1024 FP16 FLOP/cycle/core x 3.2 GHz: 419 TFLOP/s

tosh6 hours ago | root | parent

I'm not saying "GPU more brrt than CPU"

I found the comparison interesting

on Intel Xeon 690P with 419 TFLOP/s it is still (maybe even more?) interesting to ask:

how much throughput can you reach with Python, Python with lib x, y, z, with C++ like this, with C++ like that etc etc and why?

no?

p1esk6 hours ago | root | parent

No one in their right mind would use pure Python to do matrix multiplication. It’s like using a screwdriver to hammer nails into wood.

But this discussion is even more bizarre than comparing a screwdriver to a hammer, it’s like comparing a screwdriver to a nail.

#visit	13,338,315
#session	74,665
#live-session	0