I’m pretty curious to see any benchmarks on inference on VRAM vs UM.
Raptor Lake + 5080: 380.63 GB/s
Raptor Lake (CPU for reference): 20.41 GB/s
GB10 (DGX Spark): 116.14 GB/s
GH200: 1697.39 GB/s
This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.
55 t/s is much better than I could expect.
So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.
Nowadays I find most things work fine on Arm. Sometimes something needs to be built from source which is genuinely annoying. But moving from CUDA to ROCm is often more like a rewrite than a recompile.
Isn't everyone* in this segment just using PyTorch for training, or wrappers like Ollama/vllm/llama.cpp for inference? None have a strict dependency on Cuda. PyTorch's AMD backend is solid (for supported platforms, and Strix Halo is supported).
* enthusiasts whose budget is in the $5k range. If you're vendor-locked to CUDA, Mac Mini and Strix Halo are immediately ruled out.
For 5K one can get a desktop PC with RTX 5090, that has 3x more compute, but 4x less VRAM - so depending on the workload may be a better option.