Story Detail of id 48390245 | Liveview Hacker News

superkuh1 day ago | on: Gemma 4 12B: A unified, encoder-free multimodal model

>consumer-grade card with 12G of VRAM and got 5t/s

That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.

loading story #48404206

senko1 day ago | parent

Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).

I should play a bit more with llama.cpp options and see what bappened there. Thanks!

superkuh21 hours ago | root | parent

I've had it happen in the past with llama.cpp on linux that the CPU will present itself as a vulkan device GPU1 with "PHYSICAL_DEVICE_TYPE_CPU" and had a mix-up. Might want to try llama-server --list-devices and then append --device Vulkan0 or whatever.

#visit	13,567,941
#session	74,665
#live-session	0