Hacker News new | past | comments | ask | show | jobs | submit

I Put a Datacenter GPU in My Gaming PC for £200

https://blog.tymscar.com/posts/v100localllm/
loading story #48347222
Tesla V100 SXM2 16GB is NOT DGX class as the author writes. It's HGX class. The V100 comes in two classes, SXM2 and SXM4, the latter coming with a Max of 80gb on board memory. Typically these are installed 8×A100 80GB SXM4 on an HGX riser, and what that gives you is NVSwitch fabric and 640GB of pooled HBM2e (on package stacked memory /w ~2 TB/s of memory bandwidth). 2u standard rack footprint too.
Impressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.

It's prefill; slow prefill kills agentic workloads dead.

If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:

    You have: 100000 / (150/s)

    You want: hms

     11 min + 6.6666667 sec
Which is quite a wait indeed.
Most people won’t be dumping 100K tokens into it at once, but I agree that all of the prefill time that adds up during a session becomes a lot to account for.

This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.

> And yes, if you want the absolute best, Opus 4.8 exists. It also costs more per 20 minutes of heavy use than I paid for this entire GPU and adapter setup combined. But the gap is shockingly small.

I don't think this is a fair characterization of the situation. I use frontier models via API pre-paid tokens every single day, and I can barely rack up $100 per month. The fact that we figured out how to burn double this in 20 minutes is impressive, but I don't think it reflects the reality that many are experiencing right now. There are some exceptionally gluttonous approaches to harnessing LLMs that I think are serving as convenient straw men in these discussions.

Paying for the API will almost always be more economical than self-hosting equivalent infrastructure. I am not against self-hosting, but the article suggests a primarily economic motivation for this effort. If you are consuming fewer than 10^9 tokens per month, I really don't think it's worth your time to try and compete with the hyperscalars. Most of the money is to be found in the integration of this technology with existing businesses.

I use hosted providers myself, but I can churn through $100 worth of tokens in half a day even with cheap models like Deepseek easily. If someone's use is as light as yours, then sure - grab a subscription and you'll save far more. For higher use it will come down to how cheap your electricity is whether it is worth offloading at least some of it (for me it's not, FWIW)
Claude is something like $35 per million tokens. If I was using API pricing I could trivially spend $100 in a single hour long coding session, with /fast turned on in about 10 minutes. Not sure how you guys are using it.
loading story #48346802
loading story #48346809
loading story #48346923
Great write-up, I've often considered these DC cards for a project and now you've convinced me to pick one up; you describe the price of the unit against what one spends on tokens and that does it for me.
Based on the title I was really hoping to see how this was used for gaming, but they just ran an LLM on it
Same. With no new NVIDIA gaming GPUs this year, seems like an interesting problem to solve.
I don't think that is even possible, every piece of silicon on that chip that is required to do gaming is ripped out in favor of more compute cores.
The AMD MI250X GPUs are also interesting - 128GB of HBM2E at 3TB/s, sometimes you see them second-hand for under $1k, the catch obviously is that it needs an OAM socket. Never seen an easy way to hook them up to a regular mainboard.
An additional complication is that MI250Xes are two GPUs in one package, so you need to connect the first and last x16 SERDES groups to the host, otherwise you'll only see one GPU (or it won't work at all, idk).

Also, the cheap HPE pulls on eBay need some proprietary HPE magic to work, and I have yet to see anyone figure that out.

These are interesting, and offer beefy through put. No point in adapting to a PCI lane thought, stuck behind the slot-bus bottleneck.
Ahh luckily this OAM socket will prevent me from spending money.
Could probably avoid the crazy fan with a waterblock - I've seen a whole kit, v100 + PCIE adapter + block for £235. Yes, you'll have to pay for pump, radiators and radiator fans, but that should really quieten it down
Someone's already made such a kit as you describe to fit in a consumer PC case and work properly?
Congrats! Most people won’t want to debug drivers, kernels, ACPI, adapters, and fan headers. But for those who do, the capability-per-pound is absurd.
loading story #48346968
despite gaming being used in the title, it is not mentioned in the article, but i'm curious how this performs.

i've ran some multi vendor frankenstein setups before and sometimes it even works, so i'm curious to hear your experience with it.

The real question: did your local LLM write this post?
loading story #48346616
AI written posts will kill HN.
Some context:

- In 2017, the v100 was a ~$10,000 GPU. I believe there was a PCI-e version but this is probably so cheap because SXM2 is going to be harder to use;

- A 5090 has 1800GB/s of internal memory bandwidth (compared to 900GB/s in the 9 year old GPU). Of course a 5090 is substantially more expensive;

- A 5090 has ~21k CUDA cores vs ~5k;

- The current $10k NVidia GPU is the RTX 6000 Pro w/ 96GB of VRAM. It has slightly more CUDA cores but it otherwise pretty much just a 5090. This is unsurprising. NVidia uses VRAM for market segmentation.

Consider this: in 5-10 years, the trillions spent on AI data centers will likewise be sold for scrap most likely. That's how short the runway is for OpenAI and Anthropic to recover that investment.

Anyway, I'm kind of impressed the author managed to get this all to work. I don't think it even would've occurred to me that someone had made an SXM2 adapter, particularly because it's not even used anymore. Like props to whoever did that.

> Consider this: in 5-10 years, the trillions spent on AI data centers will likewise be sold for scrap most likely. That's how short the runway is for OpenAI and Anthropic to recover that investment.

Even more interesting: it'll devalue all of SaaS and the entire US tech sector.

We might have just shot our most valuable non-AI tech products in the foot.

loading story #48346644
loading story #48346708
I bet 3 years, but otherwise agree.
Volta (and Pascal, which I'm using) should still be supported with driver 580 as long as you don't use the open modules, and you can use up to cuda 12.9 and cudnn 9.10.2. No need to limit yourself to an old kernel.
But could you game with the GPU? Or is that purely a drivers issue?
Wait a few years, everyone will be able to put one at half the price.
{"deleted":true,"id":48346038,"parent":48345694,"time":1780238476,"type":"comment"}
Wow. V100. That brings back memories. Way to go.
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.

Had to stop there. Annoying. I can't stand AI use for writing. It makes any otherwise great article feel so disingenuous.

What a difficult world you must live in these days
loading story #48346103
That line was the exact moment I also realized the post was AI written. I kept reading though, but I am left constantly guessing at which key details might be pure hallucinations.
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.

sigh

Some resell group is going to have to make this easier. The shear amount of these cards otherwise heading towards the landfill is staggering. That is if Big Tech don't destroy them to prevent model weights from leaking.
loading story #48346868
Things like this have started to show up on eBay: https://www.ebay.com/itm/198383386991

  2X NVIDIA Tesla V100 32GB NVLink Water Cooled X99 E5-2686v4 AI Workstation PC

  Item                              Quantity
  Intel Xeon E5-2686 v4 CPU           1
  2U CPU Cooler                       1
  Jingyue X99 Motherboard             1
  DDR3 Memory                         32GB
  SSD                                 480GB
  AMD Radeon R5 240 4K Display Card   1
  NVIDIA Tesla V100 32GB SXM2 GPU     2
  NVLink SXM2 Dual-GPU Baseboard      1
  Corsair Water Cooling System        2
  850W Bronze Power Supply            1
  Dual-GPU 300G NVLink SXM2 Baseboard 1
  8654 Data Cable                     2
  8654 to PCIe Adapter Card           1
loading story #48346892
How would destroying the GPUs prevent the model weights from leaking? By the time you get your hands on them the memory is powered off for a long enough time that a cold-boot style attack is impossible.
loading story #48346406
> The shear amount of these cards otherwise heading towards the landfill is staggering.

The thought of throwing away working cards sounds so bizarre to me. I can't believe companies would dispose them into the landfill like that, it is at least worth giving away for refuse.

loading story #48346188
loading story #48346613
> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.

Because humans write exactly like this /s

Where do you think llms learned to write that way?
You can also look at past posts by the same author (before LLM usage proliferated) if you’re curious.

The project is still very cool, but it’s a little less enjoyable to read when everything sounds the same. It would be just as annoying for people to manually write in a corporate/marketing style, because humanity is what makes the small web interesting.

https://blog.tymscar.com/posts/privategithubcicd/

This, setting aside the llm issue, it is dealing with hardware in ways that -- one would think - would be celebrated on HN of all places. But we focus on presentation.
Because their custom training data contains an emphasis on such verbiage. It doesn't come from the God-knows-how-many TB of web content the model is pre-trained on. There, such phrasing is only a drop in the sea. But the "yes, you're right" phrases, the em dash, etc., come from the later stage, for which content is created according to some (probably overprecise) guidelines.
Right. The overuse of "genuinely" most of all. Seems like they put Claude through a few good rounds of training to always answer questions about its consciousness, thoughts, etc., with something about how it's "genuinely unsure," and as a result, the model learned to use "genuinely" as an intensifier in all sorts of inappropriate contexts.
Oi, I personally use adverbs everywhere. Genuinely, kids these days.
Marketing content.
> Where do you think llms learned to write that way?

Not from individual human content, that's for sure - maybe MLM marketing copy? Sleazy 4AM ads?

I mean, every time this response comes up, I keep asking the person to point at something written prior to 2022 that gets 80%+ on the LLM detectors, and yet no one can find anything.

Maybe you, postalrat, can find something written in this style that was published prior to 2022.

It's a function of the LLM "thought process"! It's not really modeled after human speech. It is in short segments but not long form, same reason you see the same rather odd nuances in LLM generated code.

If they way you thought was to run a bunch of if statements, generate content, then feed that content back to get a "score" of what seems the most plausible, run the if statements again, and adjust / merge responses, then you would write similarly. The recognizable cadence of LLM generated content is pretty clearly the result of a lot of if statements being fused together.

There's interesting stuff in this writeup but it sure seems like most of it was written by an LLM.
You know what the sad bit is? Humans do write exactly like that. That's not even particularly egregious StalkedIn marketroid speak.
X is Y. Z is Y. And Alpha is genuinely Beta.

Classic LLM writing style.

A little bit of local copium but neat read.

Isn't a rasbpi with 16gb of RAM $300 now?

The latest Raspberry Pi 5 has one 32-bit channel (2x 16-bit subchannels) of LPDDR4X-4267 SDRAM giving 17.1GB/s of bandwidth, 52x less than this GPU. Never mind lacking the CUDA and Tensor cores, so the FP16 performance is 102x less (307 GFLOPS vs 31.4 TFLOPS). So for £200, there's absolutely no comparison for this specific use-case.
Yeah thats what I'm saying. How is it so cheap????
I don't understand what point you're trying to make here? Are you talking about the price of RAM?