I'm not really sure how to make this hardware usable -- I can only really afford DeepSeek levels of pricing right now, but DeepSeek is slow and I'm really itching for something faster. Up until now, I've had a $200 per month Claude subscription, and Claude has been great, but the recent revocation of Fable 5 suddenly has me worried about losing access to whatever hosted model I choose to rely on, and of course I can't afford another month of Max 20x anyway, so DeepSeek will be pretty much my only option once this subscription period lapses (due to the lower Claude plans not being usable for me).
I want to figure out how to run something locally, but I don't want the speed to have to be even slower as a result. I've tried a few models already, and:
- Custom Qwen3-Coder-Next inference outperforms llama.cpp Q4_0 (70.9 t/s) and MLX 4-bit (80.6 t/s) at about 120 t/s, but that's still not really worth it
- Custom RWKV7-G1 inference reaches like 20,000 t/s prefill and 1000 t/s generation with the 0.1b model, and then pretty much falls over with the larger models -- hard enough that 1.5B already drops all the way down to 140 t/s generation, so I'm not even going to bother getting 13.3B numbers
- Custom Qwen3.6-35B inference reaches around 250 t/s prefill and 85 t/s generation at 4-bit quantization
Each one of these was aggressively optimized with many detailed profiling passes to maximize GPU usage, minimize latency and eliminate dispatch overhead. (I started with Rust Burn, but eventually hit CubeCL's high latencies and moved to Swift + Metal)
It feels like everything I try degrades to about the same level -- 80 to 120 t/s -- once at any usable number of active parameters. It feels like some sort of wall and it's really frustrating -- I don't have another $7000 to drop on a brand new M5 Max in order to get the performance I need, even assuming matrix multiplications are the bottleneck (it's starting to seem like memory bandwidth is)
Are there any competent models that could run at a usable speed on my hardware? I'm looking for at least 200t/s while being able to reason and call tools. Cerebras offers gpt-oss-120b at over 1000t/s, but it's so expensive and also isn't able to properly call tools most of the time.
fsuts•1h ago
As you have 128gb ram just keep trying the biggest models that fit