Story Detail of id 47473251 | Liveview Hacker News

segmondy20 hours ago | on: Tinybox – A powerful computer for deep learning

folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching.

integralid6 hours ago | parent | next

Thanks for chiming in. I'm looking for a reasonably cheap local LLM machine, and multiple 3090s is exactly what I planned to buy. Do you have any recommendations or recommend any reading material before I decide to spend money on that?

edit: Found your comment about /r/localllama, but if you have anything more to add I'm still very interested.

Aurornis18 hours ago | parent | next

> gpt-oss-120b full quant runs on my quad 3090

A 120B model cannot fit on 4 x 24GB GPUs at full quantization.

Either you're confusing this with the 20B model, or you have 48GB modded 3090s.

segmondy7 hours ago | root | parent

Some of you folks on here love to argue, gpt-oss-120b was trained in 4 bits, so it pretty much takes up 60gb.

Aurornis7 hours ago | root | parent

Good point, but you still need KV cache and more. Fitting the model alone to RAM doesn’t get the job done.

loading story #47477941

amarshall20 hours ago | parent | next

You're almost certainly (definitely, in fact) confusing the 120b and 20b models.

segmondy7 hours ago | root | parent

I'm most certainly not doing so.

   seg@seg-epyc:~/models$ du -sh * /llmzoo/models/* | sort -n
   4.0K metrics.txt
   4.0K opus
   4.0K start_llama
   8.2G nvidia_Orchestrator-8B-Q8_0.gguf
   12K  config.ini
   34G  Qwen3.5-27B
   47G  Qwen3.5-35B
   51G  Qwen3.5-27B-BF16
   61G  gpt-oss-120b-F16.gguf
   65G  Qwen3.5-35B-BF16
   106G Qwen3.5-122B-Q6
   117G GLM4.6V
   175G MiniMax-M2.5
   232G /llmzoo/models/small_models
   240G Ernie4.5-300B
   377G DeepSeekv3.2-nolight
   380G /llmzoo/models/DeepSeek-V3.2-UD
   400G /llmzoo/models/Qwen3.5-397B-Q8
   424G /llmzoo/models/KimiK2Thinking
   443G DeepSeek-Math-v2
   443G DeepSeek-V3-0324-Q5
   500G /llmzoo/models/GLM5-Q5
   546G /llmzoo/models/KimiK2.5

amarshall6 hours ago | root | parent

Oh I missed the "quad" before 3090.

ericd19 hours ago | parent

How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?

zozbot23419 hours ago | root | parent | next

MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance

ericd19 hours ago | root | parent

Yeah, I'd just be pretty surprised if they were getting 100 tokens/sec that way.

EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.

segmondy7 hours ago | root | parent

you are correct, I did forget to add quad. you should join us in r/localllama

check out what other people are getting. you're welcome.

https://www.reddit.com/r/LocalLLaMA/comments/1nunq7s/gptoss1... https://www.reddit.com/r/LocalLLaMA/comments/1p4evyr/most_ec...

ericd6 hours ago | root | parent

Thanks for the confirmation, wasn't sure if I was just going a bit senile heh. Yeah, I love /r/localllama, some of the best actual practitioners of this stuff on the internet. Also, crazy awesome frankenrigs to try and get that many huge cards working together.

I was considering picking up a couple of the 48 gig 4090/3090s on an upcoming trip to China, but I just ended up getting one of the Max-Q's. But maybe the token throughput would still be higher with the 4090 route? Impressive numbers with those 3090s!

What's the rig look like that's hosting all that?

Havoc11 hours ago | root | parent

He said quad 3090 not single

ericd7 hours ago | root | parent

Yeah, pretty sure that was edited in after I commented because 150 toks/sec was also new, but could’ve just missed it.

#visit	13,229,391
#session	74,665
#live-session	0