Hacker News new | past | comments | ask | show | jobs | submit
folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching.
Thanks for chiming in. I'm looking for a reasonably cheap local LLM machine, and multiple 3090s is exactly what I planned to buy. Do you have any recommendations or recommend any reading material before I decide to spend money on that?

edit: Found your comment about /r/localllama, but if you have anything more to add I'm still very interested.

> gpt-oss-120b full quant runs on my quad 3090

A 120B model cannot fit on 4 x 24GB GPUs at full quantization.

Either you're confusing this with the 20B model, or you have 48GB modded 3090s.

Some of you folks on here love to argue, gpt-oss-120b was trained in 4 bits, so it pretty much takes up 60gb.
Good point, but you still need KV cache and more. Fitting the model alone to RAM doesn’t get the job done.
loading story #47477941
You're almost certainly (definitely, in fact) confusing the 120b and 20b models.
I'm most certainly not doing so.

   seg@seg-epyc:~/models$ du -sh * /llmzoo/models/* | sort -n
   4.0K metrics.txt
   4.0K opus
   4.0K start_llama
   8.2G nvidia_Orchestrator-8B-Q8_0.gguf
   12K  config.ini
   34G  Qwen3.5-27B
   47G  Qwen3.5-35B
   51G  Qwen3.5-27B-BF16
   61G  gpt-oss-120b-F16.gguf
   65G  Qwen3.5-35B-BF16
   106G Qwen3.5-122B-Q6
   117G GLM4.6V
   175G MiniMax-M2.5
   232G /llmzoo/models/small_models
   240G Ernie4.5-300B
   377G DeepSeekv3.2-nolight
   380G /llmzoo/models/DeepSeek-V3.2-UD
   400G /llmzoo/models/Qwen3.5-397B-Q8
   424G /llmzoo/models/KimiK2Thinking
   443G DeepSeek-Math-v2
   443G DeepSeek-V3-0324-Q5
   500G /llmzoo/models/GLM5-Q5
   546G /llmzoo/models/KimiK2.5
Oh I missed the "quad" before 3090.
How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?
MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance
Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way.

EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.

you are correct, I did forget to add quad. you should join us in r/localllama

check out what other people are getting. you're welcome.

https://www.reddit.com/r/LocalLLaMA/comments/1nunq7s/gptoss1... https://www.reddit.com/r/LocalLLaMA/comments/1p4evyr/most_ec...

Thanks for the confirmation, wasn't sure if I was just going a bit senile heh. Yeah, I love /r/localllama, some of the best actual practitioners of this stuff on the internet. Also, crazy awesome frankenrigs to try and get that many huge cards working together.

I was considering picking up a couple of the 48 gig 4090/3090s on an upcoming trip to China, but I just ended up getting one of the Max-Q's. But maybe the token throughput would still be higher with the 4090 route? Impressive numbers with those 3090s!

What's the rig look like that's hosting all that?

He said quad 3090 not single
Yeah, pretty sure that was edited in after I commented because 150 toks/sec was also new, but could’ve just missed it.