What's the tok/s you get these days? Does it actually work well when you use more of that context?
By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.
> What's the tok/s you get these days?
I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):
% llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 134217.73 MB
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 | 189.67 ± 1.98 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 | 19.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d10000 | 168.92 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d10000 | 18.93 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d20000 | 152.42 ± 0.22 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d20000 | 17.87 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d30000 | 139.37 ± 0.28 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d30000 | 17.12 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d40000 | 128.38 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d40000 | 16.38 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d50000 | 118.07 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d50000 | 15.66 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d60000 | 108.44 ± 0.38 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d60000 | 14.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d70000 | 98.85 ± 0.18 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d70000 | 14.36 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d80000 | 91.39 ± 0.49 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d80000 | 13.84 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d90000 | 85.76 ± 0.24 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d90000 | 13.30 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d100000 | 80.19 ± 0.83 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d100000 | 12.82 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d150000 | 54.46 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d150000 | 10.17 ± 0.09 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d200000 | 47.05 ± 0.15 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d200000 | 9.04 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d250000 | 40.71 ± 0.26 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d250000 | 8.01 ± 0.02 |
build: d28961d81 (8299)
So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.
> You're the guy who launched Neovim!
That's me ;D
> I use it every day.
So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/
loading story #47479444
loading story #47479431
loading story #47479328