Story Detail of id 47476848 | Liveview Hacker News

bertili8 hours ago | on: Flash-MoE: Running a 397B Parameter Model on a Laptop

Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights?

daemonologist7 hours ago | parent | next

Most definitely - the popular engines have extensive support for doing this and controlling exactly which weights end up where (llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/... , vllm: https://docs.vllm.ai/en/stable/configuration/engine_args/#of... , sglang (haven't tried this): https://docs.sglang.io/advanced_features/server_arguments.ht...).

Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.

zozbot2348 hours ago | parent | next

Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.

Aurornis7 hours ago | parent | next

Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools.

It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.

zozbot2347 hours ago | root | parent

It's less of a "performance falls off a cliff" problem and more of a "once you offload to RAM/storage, your bottleneck is the RAM/storage and basically everything else no longer matters". This means if you know you're going to be relying on heavy offload, you stop optimizing for e.g. lots of VRAM and GPU compute since that doesn't matter. That saves resources that you can use for scaling out.

Aurornis3 hours ago | root | parent

It depends on the model and the mix. For some MoE models lately it’s been reasonably fast to offload part of the processing to CPU. The speed of the GPU still contributes a lot as long as it’s not too small of a relative portion of compute.

K0balt8 hours ago | parent

My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs

#visit	13,229,494
#session	74,665
#live-session	0