Story Detail of id 47476802 | Liveview Hacker News

JSR_FDED8 hours ago | on: Flash-MoE: Running a 397B Parameter Model on a Laptop

This is a very impressive result. If I understand correctly the bottleneck is the SSD in this architecture - the author seems to get almost 15GB/s - but I seem to remember the max b/w was about 8GB/s. What am I missing?

Roxxik8 hours ago | parent | next

IO is very bursty in these setups. When the router results are in you can start loading experts from SSD. In this brief moment the SSD is saturated.

Outside of that the SSD is idling.

Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.

I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.

Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.

Edit: Typos

zozbot2348 hours ago | root | parent

The github page mentions that you can't overlap SSD traffic and GPU compute on Apple Silicon, you get heavy contention for the shared hardware resources.

devnotes777 hours ago | root | parent

[dead]

Aurornis6 hours ago | parent | next

PCIe 5 doubles the maximum throughout. That’s why the numbers for newer SSDs are about double what you recall for the old maximum.

rado8 hours ago | parent

MacBook Pro M5 Pro and M5 Max have such SSD speed

selimthegrim8 hours ago | root | parent

I have an MBP M4 Pro and a WD Black SN850x in an external TB5 enclosure and I easily get 6-7 GB/s

#visit	13,228,869
#session	74,665
#live-session	0