I'm running a 70b model now that's okay, but it's still fairly tight. And I've got 16gb more vram then the red v2.
I'm also confused why this is 12U. My whole rig is 4u.
The green v2 has better GPUs. But for $65k, I'd expect a much better CPU and 256gb of RAM. It's not like a threadripper 7000 is going to break the bank.
I'm glad this exists but it's... honestly pretty perplexing
I imagine that's because they are buying a single SKU for the shell/case. I imagine their answer to your question would be: In order to keep prices low and quality high, we don't offer any customization to the server dimensions
I used to own a Dell Poweredge for my home-office, but those fans even on minimal setting kept me up at night
The thing that’s less useful is the 64G VRAM/128G System RAM config, even the large MoE models only need 20B for the router, the rest of the VRAM is essentially wasted (Mixing experts between VRAM and/System RAM has basically no performance benefit).
But yeah, 4x Blackwell 6000s are ~32-36k, not sure where the other $30k is going.
edit: Found your comment about /r/localllama, but if you have anything more to add I'm still very interested.
A 120B model cannot fit on 4 x 24GB GPUs at full quantization.
Either you're confusing this with the 20B model, or you have 48GB modded 3090s.
seg@seg-epyc:~/models$ du -sh * /llmzoo/models/* | sort -n
4.0K metrics.txt
4.0K opus
4.0K start_llama
8.2G nvidia_Orchestrator-8B-Q8_0.gguf
12K config.ini
34G Qwen3.5-27B
47G Qwen3.5-35B
51G Qwen3.5-27B-BF16
61G gpt-oss-120b-F16.gguf
65G Qwen3.5-35B-BF16
106G Qwen3.5-122B-Q6
117G GLM4.6V
175G MiniMax-M2.5
232G /llmzoo/models/small_models
240G Ernie4.5-300B
377G DeepSeekv3.2-nolight
380G /llmzoo/models/DeepSeek-V3.2-UD
400G /llmzoo/models/Qwen3.5-397B-Q8
424G /llmzoo/models/KimiK2Thinking
443G DeepSeek-Math-v2
443G DeepSeek-V3-0324-Q5
500G /llmzoo/models/GLM5-Q5
546G /llmzoo/models/KimiK2.5EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.
check out what other people are getting. you're welcome.
https://www.reddit.com/r/LocalLLaMA/comments/1nunq7s/gptoss1... https://www.reddit.com/r/LocalLLaMA/comments/1p4evyr/most_ec...
I was considering picking up a couple of the 48 gig 4090/3090s on an upcoming trip to China, but I just ended up getting one of the Max-Q's. But maybe the token throughput would still be higher with the 4090 route? Impressive numbers with those 3090s!
What's the rig look like that's hosting all that?
I don't see the 120B claim on the page itself. Unless the page has been edited, I think it's something the submitter added.
I agree, though. The only way you're running 120B models on that device is either extreme quantization or by offloading layers to the CPU. Neither will be a good experience.
These aren't a good value buy unless you compare them to fully supported offerings from the big players.
It's going to be hard to target a market where most people know they can put together the exact same system for thousands of dollars less and have it assembled in an afternoon. RTX 6000 96GB cards are in stock at Newegg for $9000 right now which leaves almost $30,000 for the rest of the system. Even with today's RAM prices it's not hard to do better than that CPU and 256GB of RAM when you have a $30,000 budget.
Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable.
With that said, people are trying to extend VRAM into system RAM or even NVMe storage, but as soon as you hit the PCI bus with the high bandwidth layers like KV cache, you eliminate a lot of the performance benefit that you get from having fast memory near the GPU die.
Only useful for prefill (given the usual discrete-GPU setup; iGPU/APU/unified memory is different and can basically be treated as VRAM-only, though a bit slower) since the PCIe bus becomes a severe bottleneck otherwise as soon as you offload more than a tiny fraction of the memory workload to system memory/NVMe. For decode, you're better off running entire layers (including expert layers) on the CPU, which local AI frameworks support out of the box. (CPU-run layers can in turn offload to storage for model parameters/KV cache as a last resort. But if you offload too much to storage (insufficient RAM cache) that then dominates the overhead and basically everything else becomes irrelevant.)"