You can use this approach with Intel Optane, which is wearout-resistant unlike NAND and can thus substitute for RAM. Last I checked, it was available quite cheap on the secondary market, ~$1/GB as opposed to ~$15/GB or more for DRAM. (Of course that's nowhere near as cheap as NAND, which is around ~$0.1/GB but quite wearout-prone with heavy writes.)
Meanwhile PCIe switches exist. So why not build:
1 CPU + memory + ...
N PCIe switch with each 1 low-memory GPU + 6 NVME drives (in theory 5 can saturate the GPU)
Each of those should only bother the CPU when they have some tokens produced and have plenty of PCIe lanes to get at their data.
Such a setup should be able to get a 6 to 8 times speedup from the solution detailed here, and a model compute increase should make relatively little difference in performance.