Show HN: Orion – Native Training LLMs on the Apple Neural Engine Without CoreML

2•mechramc•2h ago

Comments

mechramc•2h ago

Hi HN, It is hard to communicate how frustratingly opaque Apple's hardware stack can be. Everyone targets the Mac's GPU for local models, but there is a dedicated accelerator—the Apple Neural Engine (ANE)—sitting completely dark for LLM workloads. CoreML treats it as a black-box scheduler, stripping away any direct control or ability to train.

There are a few real caveats here, but imo the fundamental constraint to using the ANE hasn't been compute—it’s been the complete lack of a native orchestration layer. Building on incredible foundational reverse-engineering by maderix (who mapped the private ANEClient/ANECompiler APIs and discovered the ~19 TFLOPS fp16 ceiling), I wanted to see if we could bridge the gap from a raw hardware exploit to a stable runtime. I just open-sourced Orion: an end-to-end system that bypasses CoreML entirely to run and train LLMs directly on the ANE. Just to be concrete about what this took to build: my day-to-day is in enterprise systems orchestration, not writing low-level Objective-C kernels. I approached this entire build as an exercise in architectural delegation—using Claude to rapidly generate the execution syntax while I managed the system state, debugged the hardware limits, and held the structural vision. What we ran into was a wall of undocumented silicon behavior—what I'll call the hardware impedance mismatch. We cataloged 17 total programming constraints, 11 of which were completely undocumented.

For example: • The concat operation causes an immediate, silent compiler failure.

• BLOBFILE weights require a bizarre 64-byte offset from the chunk header, or you get silent numerical corruption.

• The ANE maintains internal state that hard-caps you at ~119 compilations per process.

Previous attempts at ANE training (like ANEgpt) hit a wall of NaN divergence after a single step. We solved this by wiring up a deferred compilation pipeline and implementing strict activation clamping to stop the fp16 overflow cascade (clamping activations to [-65504, 65504]). To bypass the 119-compilation limit, I used an exec() process restart loop after every training step.

The leverage here is real. Orion currently hits 170+ tokens/s for GPT-2 124M decode, and more importantly, achieves mechanically stable multi-step training on a 110M parameter transformer (loss dropping 12.3 to 6.2 over 1,000 steps with zero NaNs).

It’s not entirely clean yet. The ANE bakes weights at compile time, meaning every training update requires a ~4.2s recompilation penalty. But imo, extracting raw, zero-idle-power throughput directly from Apple's silicon isn't just a benchmark iteration—this is a layer change for local, always-on AI. The repo (Objective-C runtime, 5-pass graph compiler, no Python orchestration) is up. I’d love to know what the systems engineers here think about the constraint catalog or potential weight-patching workarounds.

Going Back to the Newspaper Model

Poor Man's Polaroid

Show HN: BitFun – An Agentic Development Environment (Rust and TypeScript)

Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

Microsoft Live Homepage has the same hero appearing twice

Show HN: Deploy OpenClaw in 1 minute and run Multiple agents

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

AI model trained on 9.3T base pairs can now design novel genes

Phi-4-reasoning-vision-15B

The Complicators, the Drama Aggregators, and the Avoiders

NIS-2 is not a bureaucratic monster (German)

Building PDR AI – Open-source startup accelerator engine

Android: A new era for choice and openness

The Markless Document Markup Standard

Jails for NetBSD – Kernel Enforced Isolation and Native Resource Control

Compress PDF

PageIndex: Vectorless, Reasoning-Based RAG

AI Agent Broke Its Promise. Now What?

Ghinst – Install from GitHub release section to –/.local/bin

Restoring ReBoot from the Original Master D1 Tapes [video]

Show HN: The Playwright GitHub Repositories Worth Studying

Teaching Coding Agents to Drive Cmux

The cognitive cost of easy answers, a lesson from RL

Washington Post – In the Long Run, Wars Make Us Safer and Richer (2014)

The Self-Help Trap: What 20 Years of "Optimizing" Has Taught Me

Improving Django Admin UI with Django-unfold

A GB300 thread that running vLLM and SGlang on it

Show HN: Your AI Slop Bores Me

Gogcli – Google in Your Terminal

Show HN: Nemilia – multi-agent AI workspace in a single HTML file, no back end