Story Detail of id 47471203 | Liveview Hacker News

oofbey1 day ago | on: Tinybox – A powerful computer for deep learning

DGX Spark is a fantastic option at this price point. You get 128GB VRAM which is extremely difficult to get at this price point. Also it’s a fairly fast GPU. And stupidly fast networking - 200gbps or 400gbps mellanox if you find coin for another one.

ekropotin1 day ago | parent | next

I’m not very well versed in this domain, but I think it’s not going to be “VRAM” (GDDR) memory, but rather “unified memory”, which is essentially RAM (some flavour of DDR5 I assume). These two types of memory has vastly different bandwidth.

I’m pretty curious to see any benchmarks on inference on VRAM vs UM.

banana_giraffe19 hours ago | root | parent | next

A quick benchmark using float32 copies using torch cuda->cuda copies, comparing some random machines:

    Raptor Lake + 5080: 380.63 GB/s
    Raptor Lake (CPU for reference): 20.41 GB/s
    GB10 (DGX Spark): 116.14 GB/s
    GH200: 1697.39 GB/s

This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.

In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.

ekropotin19 hours ago | root | parent

Nice! Thanks for that.

55 t/s is much better than I could expect.

oofbey23 hours ago | root | parent

I’m using VRAM as shorthand for “memory which the AI chip can use” which I think is fairly common shorthand these days. For the spark is it unified, and has lower bandwidth than most any modern GPU. (About 300 GB/s which is comparable to an RTX 3060.)

So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.

BobbyJo1 day ago | parent | next

Internet seems to think the SW support for those is bad, and that strix halo boxes are better ROI.

oofbey1 day ago | root | parent

Meh. DGX is Arm and CUDA. Strix is X86 and ROCm. Cuda has better support than ROCm . And x86 has better support than Arm.

Nowadays I find most things work fine on Arm. Sometimes something needs to be built from source which is genuinely annoying. But moving from CUDA to ROCm is often more like a rewrite than a recompile.

overfeed21 hours ago | root | parent | next

> But moving from CUDA to ROCm is often more like a rewrite than a recompile.

Isn't everyone* in this segment just using PyTorch for training, or wrappers like Ollama/vllm/llama.cpp for inference? None have a strict dependency on Cuda. PyTorch's AMD backend is solid (for supported platforms, and Strix Halo is supported).

* enthusiasts whose budget is in the $5k range. If you're vendor-locked to CUDA, Mac Mini and Strix Halo are immediately ruled out.

oofbey5 hours ago | root | parent

Most everything starts as PyTorch. (Or maybe Jax.) But the inference engines all use hand tuned CUDA kernels - at least the good ones do. You have to do that to optimize things.

overfeed1 hour ago | root | parent

I'm certain inference engines don't use hand-tuned CUDA on Radeon or Mac Mini chips. My statement holds: those engines have no strict dependency on CUDA, or they'd be Nvidia-only.

BobbyJo23 hours ago | root | parent

CUDA != Driver support. Driver support seems to be what's spotty with DGX, and iirc Nvidia jas only committed to updates for 2 years or something.

borissk1 day ago | parent

Can even network 4 of these together, using a pretty cheap InfiniBand switch. There is a YouTube video of a guy building and benchmarking such setup.

For 5K one can get a desktop PC with RTX 5090, that has 3x more compute, but 4x less VRAM - so depending on the workload may be a better option.

ekropotin1 day ago | root | parent

VRAM vs UM is not exactly apples to apples comparison.

#visit	13,229,489
#session	74,665
#live-session	0