With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
jychang•1h ago
Not really breakthroughs, more like bugfixes for their broken first batch.
Kayou•1h ago
Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
segmondy•1h ago
llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
Havoc•1h ago
Advances in this space are always welcome.
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
jychang•1h ago
What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
tosh•1h ago
Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5
Maxious•1h ago
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
jychang•1h ago
Kayou•1h ago
segmondy•1h ago