>consumer-grade card with 12G of VRAM and got 5t/s
That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.
loading story #48404206
Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).
I should play a bit more with llama.cpp options and see what bappened there. Thanks!
I've had it happen in the past with llama.cpp on linux that the CPU will present itself as a vulkan device GPU1 with "PHYSICAL_DEVICE_TYPE_CPU" and had a mix-up. Might want to try llama-server --list-devices and then append --device Vulkan0 or whatever.