* RAM - $1500 - Crucial Pro 128GB Kit (2x64GB) DDR5 RAM, 5600MHz CP2K64G56C46U5, up to 4 sticks for 128GB or 256GB, Amazon
* GPU - $4700 - RTX Pro 5000 48GB, Microcenter
* CPU/Mobo bundle - $1100 - AMD Ryzen 7 9800X3D, MSI X870E-P Pro, ditch the 32GB RAM, Microcenter
* Case - $220, Hyte Y70, Microcenter
* Cooler - $155, Arctic Cooling Liquid Freezer III Pro, top-mount it, Microcenter
* PSU - $180, RM1000x, Microcenter
* SSD - $400 - Samsung 990 pRO 2TB gen 4 NVMe M.2
* Fans - $100 - 6x 120mm fans, 1x 140mm fan, of your choice
Look into models like Qwen 3.5
This is certainly not the most effective use of $7k for running local LLMs.
The answer is a 16" M5 Max 128GB for $5k. You can run much bigger models than your setup while being an awesome portable machine for everything else.
In terms of GPU memory bandwidth (models fitting in the ~48GB of RTX 5000 Pro card), the RTX card I described above has over 2x the bandwidth of an M5 Max.
If leveraging system RAM (the 128GB-256GB outside the GPU) to run larger models, then the memory bandwidth is ~6x slower than M5 Max.
For models fitting in the ~48GB RTX memory, like dense Qwen3.5 27B models, the RTX will be 2-4x faster than M5 Max. For models that don't fit in the 48GB RTX memory, the M5 Max will be 5-20x faster.
Also worth considering future upgrades: Do you plan to throw away the machine in a few years, or pick up multiple used RTX 6000 Pro cards when people start ditching them?
https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...
A small joke at this weeks GTC was the "BOGOD" discount was to sell them at $4K each...
Machines with the 4xx chips are coming next month so maybe wait a week or two.
It's soldered LPDDR5X with amd strix halo ... sglang and llama.cpp can do that pretty well these days. And it's, you know, half the price and you're not locked into the Nvidia ecosystem
You can check what each model does on AMD Strix halo here:
Mac Studio or Mac Mini, depending on which gives you the highest amount of unified memory for ~$5k.
I’m pretty curious to see any benchmarks on inference on VRAM vs UM.
Raptor Lake + 5080: 380.63 GB/s
Raptor Lake (CPU for reference): 20.41 GB/s
GB10 (DGX Spark): 116.14 GB/s
GH200: 1697.39 GB/s
This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.
55 t/s is much better than I could expect.
So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.
Nowadays I find most things work fine on Arm. Sometimes something needs to be built from source which is genuinely annoying. But moving from CUDA to ROCm is often more like a rewrite than a recompile.
Isn't everyone* in this segment just using PyTorch for training, or wrappers like Ollama/vllm/llama.cpp for inference? None have a strict dependency on Cuda. PyTorch's AMD backend is solid (for supported platforms, and Strix Halo is supported).
* enthusiasts whose budget is in the $5k range. If you're vendor-locked to CUDA, Mac Mini and Strix Halo are immediately ruled out.
For 5K one can get a desktop PC with RTX 5090, that has 3x more compute, but 4x less VRAM - so depending on the workload may be a better option.