Story Detail of id 47477644 | Liveview Hacker News

Aurornis7 hours ago | on: Flash-MoE: Running a 397B Parameter Model on a Laptop

Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools.

It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.

zozbot2346 hours ago | parent

It's less of a "performance falls off a cliff" problem and more of a "once you offload to RAM/storage, your bottleneck is the RAM/storage and basically everything else no longer matters". This means if you know you're going to be relying on heavy offload, you stop optimizing for e.g. lots of VRAM and GPU compute since that doesn't matter. That saves resources that you can use for scaling out.

loading story #47479958

#visit	13,228,880
#session	74,665
#live-session	0