frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Ask HN: What are some good/fast coding models for Apple Silicon?

2•LoganDark•1h ago
I have an M4 Max with 128 GB of unified memory, and I thought it would be easy to reach decent inference speeds with it. After a few failed attempts to exceed about 150 t/s with completely custom Metal inference engines tailor-built by Claude, I'm stumped.

I'm not really sure how to make this hardware usable -- I can only really afford DeepSeek levels of pricing right now, but DeepSeek is slow and I'm really itching for something faster. Up until now, I've had a $200 per month Claude subscription, and Claude has been great, but the recent revocation of Fable 5 suddenly has me worried about losing access to whatever hosted model I choose to rely on, and of course I can't afford another month of Max 20x anyway, so DeepSeek will be pretty much my only option once this subscription period lapses (due to the lower Claude plans not being usable for me).

I want to figure out how to run something locally, but I don't want the speed to have to be even slower as a result. I've tried a few models already, and:

- Custom Qwen3-Coder-Next inference outperforms llama.cpp Q4_0 (70.9 t/s) and MLX 4-bit (80.6 t/s) at about 120 t/s, but that's still not really worth it

- Custom RWKV7-G1 inference reaches like 20,000 t/s prefill and 1000 t/s generation with the 0.1b model, and then pretty much falls over with the larger models -- hard enough that 1.5B already drops all the way down to 140 t/s generation, so I'm not even going to bother getting 13.3B numbers

- Custom Qwen3.6-35B inference reaches around 250 t/s prefill and 85 t/s generation at 4-bit quantization

Each one of these was aggressively optimized with many detailed profiling passes to maximize GPU usage, minimize latency and eliminate dispatch overhead. (I started with Rust Burn, but eventually hit CubeCL's high latencies and moved to Swift + Metal)

It feels like everything I try degrades to about the same level -- 80 to 120 t/s -- once at any usable number of active parameters. It feels like some sort of wall and it's really frustrating -- I don't have another $7000 to drop on a brand new M5 Max in order to get the performance I need, even assuming matrix multiplications are the bottleneck (it's starting to seem like memory bandwidth is)

Are there any competent models that could run at a usable speed on my hardware? I'm looking for at least 200t/s while being able to reason and call tools. Cerebras offers gpt-oss-120b at over 1000t/s, but it's so expensive and also isn't able to properly call tools most of the time.

Comments

fsuts•1h ago
80-120t/s is high, 40 is more average

As you have 128gb ram just keep trying the biggest models that fit

Qwen-RobotWorld Technical Report

https://arxiv.org/abs/2606.17030
1•ilreb•1m ago•0 comments

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

https://arxiv.org/abs/2606.14777
1•ilreb•1m ago•0 comments

Making Ast.walk 220x Faster

https://reflex.dev/blog/why-ast-walk-when-you-can-ast-sprint/
2•palashawas•2m ago•0 comments

A Short Explanation of the Zettelkasten Method

https://isgin01.github.io/posts/explanation-of-zettelkasten/
1•pullshark91•2m ago•0 comments

Robinhood to cut 10% of workforce in restructuring

https://www.reuters.com/sustainability/robinhood-cut-10-its-full-time-workforce-2026-06-16/
1•indiesense•2m ago•0 comments

The ongoing debacle of hiring a fake coworker

1•blinkbat•2m ago•0 comments

Uncritical use of AI causes countrywide scandal at Starbucks Korea

https://www.theguardian.com/world/2026/jun/16/starbucks-korea-shut-all-stores-tank-day-promotionB...
1•Blackthorn•3m ago•0 comments

Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy

https://harrisonsec.com/blog/agent-architecture-compute-allocation-advisor-strategy/
1•gzxharrison001•4m ago•0 comments

How we evaluate our LLM judge

https://build.forus.com/how-we-evaluate-our-llm-judge-a-perturbation-based-approach
2•abeinstein•5m ago•0 comments

Can gzip be a language model?

https://nathan.rs/posts/gzip-lm/
2•nathan-barry•6m ago•0 comments

The Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

https://arxiv.org/abs/2606.16118
1•root-parent•7m ago•0 comments

The AI Hype – Too Costly – Alternative Rock, Original Lyrics [Video]

https://www.youtube.com/watch?v=jwfuNk2cRDc
1•NedCode•8m ago•0 comments

The Same Hetzner VM Cost $60 Last Week. Today It Costs $154

https://webbynode.com/articles/same-hetzner-vm-cost-60-last-week-today-hetzner-offers-it-at-154
1•gsgreen•9m ago•2 comments

Python 3.13 gets a JIT (2024)

https://tonybaloney.github.io/posts/python-gets-a-jit.html
1•tosh•9m ago•0 comments

TreeTrace, Git records what changed;this records how you steer your LLM sessions

https://github.com/TreeTraceTool/TreeTrace
1•ZionBoggan•9m ago•0 comments

Never Talk to the Police. Period

https://www.campolalaw.com/why-you-should-never-talk-to-the-po
2•Cider9986•10m ago•0 comments

Databricks Acquires Panther

https://www.databricks.com/company/newsroom/press-releases/databricks-agrees-acquire-panther-furt...
2•scapecast•12m ago•0 comments

Show HN: Sentinel – prevent duplicate execution using Postgres

https://github.com/Sreejay-reddy/Sentinel
1•Sreejay_reddy•13m ago•0 comments

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

https://twitter.com/fguzmanai/status/2065832668172845209
7•laxmena•15m ago•0 comments

Hardware Is Asynchronous. Most of Our Operating Systems Still Aren't

https://vorjdux.com/articles/hardware-is-async.html
3•homarp•15m ago•0 comments

Apple's weird anti-nausea dots cured my car sickness

https://www.theverge.com/tech/942854/apple-vehicle-motion-cues-review-really-work
5•neilfrndes•15m ago•0 comments

Steve Jobs in Exile by Geoffrey Cain

https://auxiliarymemory.com/2026/06/01/steve-jobs-in-exile-by-geoffrey-cain/
1•speckx•15m ago•0 comments

Stop rebuilding your billing system

https://useautumn.com/blog/stop-rebuilding-billing
1•johnyeocx•16m ago•0 comments

Russian frigate fires warning shots at British yacht in English Channel

https://www.theguardian.com/uk-news/2026/jun/16/russian-frigate-fires-warning-shots-at-british-ya...
3•manarth•16m ago•0 comments

We should vaccinate wild animals

https://worksinprogress.co/issue/why-we-should-vaccinate-wild-animals/
5•duffydotsvg•17m ago•0 comments

Show HN: Docket – Semantic search over your local files, runs in the browser

https://docketapp.netlify.app/
1•owenthecoder13•17m ago•0 comments

2024-25 Covid-19 Vaccine and Major Adverse Cardiovascular Events in US Veterans

https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2850241
2•bookofjoe•18m ago•0 comments

The Dangerous Tech Found Aboard 'Dark-Fleet' Tankers Captured by the U.S.

https://www.wsj.com/articles/the-dangerous-tech-found-aboard-dark-fleet-tankers-captured-by-the-u...
2•CSMastermind•19m ago•0 comments

Arrests, prosecutions, convictions or fines for online speech by country

https://github.com/kevinnbass/state_action_against_online_speech_globally
6•MrBuddyCasino•20m ago•1 comments

Show HN: In Browser semantic wallpaper search over 16k+ wallpapers

https://web-inky-ten-60.vercel.app
3•rdksu•20m ago•0 comments