Ask HN: What are some good/fast coding models for Apple Silicon?

2•LoganDark•1h ago

I have an M4 Max with 128 GB of unified memory, and I thought it would be easy to reach decent inference speeds with it. After a few failed attempts to exceed about 150 t/s with completely custom Metal inference engines tailor-built by Claude, I'm stumped.

I'm not really sure how to make this hardware usable -- I can only really afford DeepSeek levels of pricing right now, but DeepSeek is slow and I'm really itching for something faster. Up until now, I've had a $200 per month Claude subscription, and Claude has been great, but the recent revocation of Fable 5 suddenly has me worried about losing access to whatever hosted model I choose to rely on, and of course I can't afford another month of Max 20x anyway, so DeepSeek will be pretty much my only option once this subscription period lapses (due to the lower Claude plans not being usable for me).

I want to figure out how to run something locally, but I don't want the speed to have to be even slower as a result. I've tried a few models already, and:

- Custom Qwen3-Coder-Next inference outperforms llama.cpp Q4_0 (70.9 t/s) and MLX 4-bit (80.6 t/s) at about 120 t/s, but that's still not really worth it

- Custom RWKV7-G1 inference reaches like 20,000 t/s prefill and 1000 t/s generation with the 0.1b model, and then pretty much falls over with the larger models -- hard enough that 1.5B already drops all the way down to 140 t/s generation, so I'm not even going to bother getting 13.3B numbers

- Custom Qwen3.6-35B inference reaches around 250 t/s prefill and 85 t/s generation at 4-bit quantization

Each one of these was aggressively optimized with many detailed profiling passes to maximize GPU usage, minimize latency and eliminate dispatch overhead. (I started with Rust Burn, but eventually hit CubeCL's high latencies and moved to Swift + Metal)

It feels like everything I try degrades to about the same level -- 80 to 120 t/s -- once at any usable number of active parameters. It feels like some sort of wall and it's really frustrating -- I don't have another $7000 to drop on a brand new M5 Max in order to get the performance I need, even assuming matrix multiplications are the bottleneck (it's starting to seem like memory bandwidth is)

Are there any competent models that could run at a usable speed on my hardware? I'm looking for at least 200t/s while being able to reason and call tools. Cerebras offers gpt-oss-120b at over 1000t/s, but it's so expensive and also isn't able to properly call tools most of the time.

Comments

fsuts•1h ago

80-120t/s is high, 40 is more average

As you have 128gb ram just keep trying the biggest models that fit

Qwen-RobotWorld Technical Report

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Making Ast.walk 220x Faster

A Short Explanation of the Zettelkasten Method

Robinhood to cut 10% of workforce in restructuring

The ongoing debacle of hiring a fake coworker

Uncritical use of AI causes countrywide scandal at Starbucks Korea

Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy

How we evaluate our LLM judge

Can gzip be a language model?

The Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

The AI Hype – Too Costly – Alternative Rock, Original Lyrics [Video]

The Same Hetzner VM Cost $60 Last Week. Today It Costs $154

Python 3.13 gets a JIT (2024)

TreeTrace, Git records what changed;this records how you steer your LLM sessions

Never Talk to the Police. Period

Databricks Acquires Panther

Show HN: Sentinel – prevent duplicate execution using Postgres

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

Hardware Is Asynchronous. Most of Our Operating Systems Still Aren't

Apple's weird anti-nausea dots cured my car sickness

Steve Jobs in Exile by Geoffrey Cain

Stop rebuilding your billing system

Russian frigate fires warning shots at British yacht in English Channel

We should vaccinate wild animals

Show HN: Docket – Semantic search over your local files, runs in the browser

2024-25 Covid-19 Vaccine and Major Adverse Cardiovascular Events in US Veterans

The Dangerous Tech Found Aboard 'Dark-Fleet' Tankers Captured by the U.S.

Arrests, prosecutions, convictions or fines for online speech by country

Show HN: In Browser semantic wallpaper search over 16k+ wallpapers