Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

152•sarkory•8h ago

Comments

pj_mukh•4h ago

Super cool, and will definitely check it out.

But as a measure for what you can achieve with a course like this: does anyone know what the max tok/s vs iPhone model plot look like, and how does MLX change that plot?

simonw•3h ago

MLX is worth paying attention to. It's still pretty young (just over a year old) but the amount of activity in that ecosystem is really impressive, and it's quickly becoming the best way to run LLMs (and vision LLMs and increasingly audio models) on a Mac.

Here's a fun way to start interacting with it (this loads and runs Llama 3.2 3B in a terminal chat UI):

  uv run --isolated --with mlx-lm python -m mlx_lm.chat

esafak•3h ago

https://ml-explore.github.io/mlx/

masto•36m ago

Ran it and it crapped out with a huge backtrace. I spotted `./build_bundled.sh: line 21: cmake: command not found` in it, so I guessed I needed cmake installed. `brew install cmake` and try again. Then it crapped out with `Compatibility with CMake < 3.5 has been removed from CMake.`. Then I give up.

This is typical of what happens any time I try to run something written in Python. It may be easier than setting up an NVIDIA GPU, but that's a low bar.

simonw•19m ago

Which Python version was that? Could be that MLX have binary wheels for some versions but not others.

masto•15m ago

Adding `-p 3.12` made it work. Leaving that here in case it helps someone.

fsiefken•3h ago

That's great, like the ai ryzen max 395, apple silicon chips are also more energy efficient for llm (or gaming) then nvidia.

For 4 bit deepseek-r1-distill-llama-70b on a Macbook Pro M4 Max with the MLX version on LM Studio: 10.2 tok/sec on power and 4.2 tok/sec on battery / low power

For 4 bit gemma-3-27b-it-qat I get: 26.37 tok/sec on power and on battery low power 9.7

It'd be nice to know all the possible power tweaks to get the value higher and get additional insight on how llm's work and interact with the cpu and memory.

nico•3h ago

Thank you for the numbers

What have you used those models for, and how would you rate them in those tasks?

realo•1h ago

RPG prompts works very very well with many of the models, but not the reasoning ones because it ends up thinking endlessly about how to be the absolute best game master possible...

nico•42m ago

Great use case. And very funny situation with the reasoning models! :)

robbru•3h ago

TinyLLM is very cool to see! I will def tinker with it. I've been using MLX format for local LLMs as of late. Kinda amazing to see these models become cheaper and faster. Check out the MLX community on HuggingFace. https://huggingface.co/mlx-community

nico•3h ago

Great recommendation about the community

Any other resources like that you could share?

Also, what kind of models do you run with mlx and what do you use them for?

Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)

simonw•2h ago

I've shared a bunch of notes on MLX over the past year, many of them with snippets of code I've used to try out models: https://simonwillison.net/tags/mlx/

I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)

I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio

The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy

nico•2h ago

Amazing. Thank you for the great resources!

gitroom•2h ago

dang, i've been messing with mlx too and its blowing my mind how quick this stuff is getting on macs. feels like somethings changing every time i blink

Migrating Away from Rust

Reports of the death of California High-Speed Rail have been greatly exaggerated

Show HN: I built a hardware processor that runs Python

Widespread power outage in Spain and Portugal

Show HN: A pure WebGL image editor with filters, crop and perspective correction

Giving Software Away for Free

Vision Transformers Need Registers

Show HN: Sim Studio – Open-Source Agent Workflow GUI

Show HN: Web-eval-agent – Let the coding agent debug itself

Uncovering the mechanics of The Games: Winter Challenge

Activeloop (YC S18) Is Hiring VP of Engineering in Mountain View (On-Site)

The side hustle from hell

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

Show HN: Heart Rate Zones Plus – The first iOS app I ever developed

Show HN: Autarkie – Instant grammar fuzzing using Rust macros

Generating Mazes with Inductive Graphs (2017)

It's Not the Incentives – It's You

Why Momentum Works (2017)

Is Outbound Going to Die?

The Books of Earthsea by Ursula K. Le Guin

Reanimation of the original Logic Theorist, the first AI, in IPL-V

Internet in a Box

Show HN: I made a web-based, free alternative to Screen Studio

Reality Check

AI helps unravel a cause of Alzheimer’s and identify a therapeutic candidate

Ask HN: What are you working on? (April 2025)

How a single line of code could brick your iPhone

Presentation Slides with Markdown

The hospital where staff treat fear of death as well as physical pain

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

Comments

Migrating Away from Rust

Reports of the death of California High-Speed Rail have been greatly exaggerated

Show HN: I built a hardware processor that runs Python

Widespread power outage in Spain and Portugal

Show HN: A pure WebGL image editor with filters, crop and perspective correction

Giving Software Away for Free

Vision Transformers Need Registers

Show HN: Sim Studio – Open-Source Agent Workflow GUI

Show HN: Web-eval-agent – Let the coding agent debug itself

Uncovering the mechanics of The Games: Winter Challenge

Activeloop (YC S18) Is Hiring VP of Engineering in Mountain View (On-Site)

The side hustle from hell

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

Show HN: Heart Rate Zones Plus – The first iOS app I ever developed

Show HN: Autarkie – Instant grammar fuzzing using Rust macros

Generating Mazes with Inductive Graphs (2017)

It's Not the Incentives – It's You

Why Momentum Works (2017)

Is Outbound Going to Die?

The Books of Earthsea by Ursula K. Le Guin

Reanimation of the original Logic Theorist, the first AI, in IPL-V

Internet in a Box

Show HN: I made a web-based, free alternative to Screen Studio

Reality Check

AI helps unravel a cause of Alzheimer’s and identify a therapeutic candidate

Ask HN: What are you working on? (April 2025)

How a single line of code could brick your iPhone

Presentation Slides with Markdown

The hospital where staff treat fear of death as well as physical pain

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models