Build your own Siri locally and on-device

https://thehyperplane.substack.com/p/build-your-own-siri-locally-on-device

185•andreeamiclaus•9mo ago

Comments

caust1c•9mo ago

So build your own crappy agent-assistant?

In earnest though, I'm certain we'll see a community replacement of Siri by end-of-year if the iPhone permissions model allows it or there's some workaround. IDK what the limitations are here but I'm eagerly awaiting the community to step in where Siri has failed.

Ancapistani•9mo ago

The assistant is only half the story here. This looks like a great, well-defined tutorial project to learn how to put this stuff together locally.

andrewmcwatters•9mo ago

Crappy? Dude, Siri at one point couldn't even tell you what today's date was. The bar is on the ground.

4ndrewl•9mo ago

Think Different

0cf8612b2e1e•9mo ago

I do not know if it is because I have been trained to make simple requests, but there are only a half dozen things I would verbally ask a robot.

- time of day

- calendar date

- weather

- set a timer

- simple math calculation

That’s 90% of the functionality right there.

andrewmcwatters•9mo ago

Funny you mention this. My usage is the same. I suspect we’ve all been trained to expect little-to-nothing from these assistants.

jihadjihad•9mo ago

And incredibly it manages to get those wrong a non-trivial amount of the time.

sureIy•9mo ago

I'm sorry, something went wrong with your "what time is it" request.

Oh you asked "what time is it in Miami?" Because I already cut you off after "what time is it".

This happens on a daily basis when I'm not talking right into my phone.

trinix912•9mo ago

It's even better asking it to play a playlist I have made and downloaded in Apple Music only for it to say "I'm going to need your permission to access your Spotify data" and play something completely random.

Even saying something like "play the playlist ____ in Apple Music" doesn't help, it cuts the "in Apple Music" part of.

wkat4242•9mo ago

For me I ask a lot of things like "How do I say <xxx> in Spanish". It's better than a google translate because it's not quite as literal, it will translate to proper colloquialisms if necessary.

dang•9mo ago

> So build your own crappy agent-assistant?

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something." - https://news.ycombinator.com/newsguidelines.html

(Your comment would be fine without that first bit.)

andreeamiclaus•9mo ago

A better one* Siri can't even tell me the weather right, this assistant is an elevated version that works first, then performs great second

thedeep_mind•9mo ago

This is great, thanks for putting this together.

Haven't followed it through yet, but does this model run successfully on an iPhone?

My 9 year old ran a Qwen 0.6B model using ollama quite well, anything else was too slow to offer a good UX.

SparkyMcUnicorn•9mo ago

MLC[0] indicates that it can run models in the 8B range on iOS, but 1-3B sounds more reasonable to me.

[0] https://llm.mlc.ai/docs/deploy/ios.html#bring-your-own-model

parpfish•9mo ago

Oh, a nine year old PHONE.

I was thinking there was a fourth grader out there deploying models when at that age I was still learning multiplication tables.

NetOpWibby•9mo ago

My son just turned 9 today so I was like, "Wow! I wonder if my kid would be interested in doing this?"

mrcwinn•9mo ago

Cool project and nice write-up!

mystified5016•9mo ago

Does Apple even allow you to replace Siri with another assistant? For the longest time on android, all non-Google assistants were crippled by not being able to listen in the background or use the assistant hardkey, gestures, or shortcuts. I'm not sure if the Google assistant still has privileges others don't, but I wouldn't be surprised in the least.

jedisct1•9mo ago

More or less. This is what Perplexity does.

bronco21016•9mo ago

I saw an article about this and downloaded the Perplexity app but I was unable to figure out if this was true? Do I need a paid tier? I just quickly worked through the free sign up and couldn't sort it out. The demo looked really slick. Is it worth pursuing?

matthewfcarlson•9mo ago

Part of the problem is the wake word “hey siri” is actually handed by a separate coprocessor (AOP) with the model compiled into the firmware. While anything is technically possible, it isn’t as simple as just letting the google app run in the background since the AP is asleep when any of these gesture happen. You could probably setup the action button on the side to open an assistant, but that’s going to be a less pleasant experience (app might not be open, etc).

Details are listed below

https://machinelearning.apple.com/research/hey-siri

kimixa•9mo ago

Same with android phones - a super-specific hardcoded phrase is much easier to work in the power budgets required for an "always on" part of the device.

It's why a manufacturer (like Samsung) can change that sort of thing on their devices, but it's not realistically something an end user (or even an app) can customize in software. It's not some "arbitrary" limitation.

smurpy•9mo ago

Back in 1992 or so the NeXT could distinguish (was it 16 or) 64 fixed, trained, phrases. Point being, it doesn’t take too much compute with a finite vocabulary.

trinix912•9mo ago

But wouldn't adding your own phrases require a reflash of parts of firmware in this context?

layer8•9mo ago

I think people would be fine with having to call it Siri if only they could replace the actual assistant.

wkat4242•9mo ago

There's open solutions for that like openwakeword and microwakeword (the latter can even run on an esp32!)

The training is a lot of work though and requires a lot of material. For Home Assistant's voice preview model they had tens of thousands of volunteers record the "okay nabu" wakeword and even still it doesn't work quite as well as hey siri on Apple devices.

HnUser12•9mo ago

You can now setup Vocal Shortcuts[1] which can be used to run any shortcut or action with almost any trigger word and without saying "Siri". However, I'm not certain if it can wake the device from sleep or not.

[1] https://support.apple.com/en-in/guide/iphone/iph7f242ea2c/io...

dangus•9mo ago

I presume you could pretty easily use new-ish action button to run a custom shortcut that brings up an alternative assistant app.

catapart•9mo ago

Man, I'd really love it if this were just a product/app I could download and use a UI to configure/teach.

But this guide gives me what I need to make that, I think, so a big thank you for this!

worldsayshi•9mo ago

I love the idea and I would like to build something like this. But the few attempts i have made using whisper locally has so far been underwhelming. Has anyone gotten results with small whisper models that are good enough for a use case like this?

Maybe I've just had a bad microphone.

codethief•9mo ago

> Maybe I've just had a bad microphone.

Yeah, I would definitely double-check your setup. At work we use Whisper to live-transcribe-and-translate all-hands meetings and it works exceptionally well.

s3p•9mo ago

+1 this. Whisper works insanely well. I've been using the medium model as it has yet to mis transcribe anything noticeable, and it's very lightweight. I even converted it to a coreML model so it runs accelerated on apple silicon. It doesn't run *that* much faster than before.. but it ran really fast to begin with. For anyone tinkering, ive had much success with whisper.cpp.

azinman2•9mo ago

What was the process of converting it like? I assume you then had to write all of the inference code as well?

tough•9mo ago

not the gp but found this https://github.com/ggml-org/whisper.cpp/blob/master/models/c...

Grimblewald•8mo ago

I'd agree with your experience. I simply sit my phone (~200 dollar motorola, cheap phone) in centre of room, split voice file into chunks using voice prints/ID's I get from a voice embedding model I trained, then feed labelled chunks through whisper, and get a nice transcript of everything said. I combine that with my handwritten notes (as image, get a VLM to transcribe) and the agenda, and I get out really nice meeting minutes as a LaTex document. Works a charm and has turned an hour or two of work per meeting into maybe 30 minutes (proofing what was written).

wkat4242•9mo ago

Which model do you use? I use large usually, on a GPU. It's fast and works really well. Be aware though that it can only recognise one language at a time. It will autodetect if you don't specify one.

Of course the smaller models don't work nearly as well and they are often restricted to English. Large works great for me though it does require GPU hardware to be responsive enough, even with faster-whisper or insanely-fast-whisper.

jtr1•9mo ago

I’ve noticed recently (maybe I missed an announcement) that Siri now functions locally for at least some commands. Try putting an Apple watch in airplane mode and asking it to set a timer or reminder

swiftcoder•9mo ago

Siri has had limited offline functionality since at least iOS 15? Although I don't think most users noticed at the time, since most of Siri's command vocabulary is for things that require a network connection...

cadamsdotcom•9mo ago

Why haven’t Apple taken a look at the data then hardcoded handlers for the top ~1000 usages???

s3p•9mo ago

They are doing this, just at a mind-numbingly slow pace. They seem to add controls for brightness and power but don't make it clear what works when offline. It's not even worth trying because there's no guide or documentation on what commands would be available. You just have to go into airplane mode and try asking stuff. Awful UX

paul7986•9mo ago

Faithful year and half user of chatGPT on my iPhone which has made me loathe Siri for how dumb she is in every sense of the way!

When will OpenAI (with the help of Microsoft) release a GPT phone to compete with the iPhone? Im so tired of the boring iPhone! Give me a GPT phone where from my lock screen GPT does everything for me. Fingers crossed :) it's secretively in the works!

rrr_oh_man•9mo ago

Why an LLM-written article, though?

norskeld•9mo ago

This summary-like style — with heavy formatting and every (!) paragraph as a bulleted list — drives me nuts tbqh. Especially in lengthy texts, it just looks... noisy, bland, and sometimes confusing.

andreeamiclaus•9mo ago

What's the format you would prefer? We're not using ChatGPT to write and we've experimented with this format. The other articles may have a better format?

rrr_oh_man•9mo ago

> We're not using ChatGPT to write

We can go into semantics, but that listicle has all the hallmarks of LLM-produced stuff, down the the "Generated Image"-tagged image.

andreeamiclaus•8mo ago

The image yes, it was generated and that's pretty clear.

We used to rely more on ChatGPT but that did not work out (figures!) so now it's a lot of authentic human speech.

So I'm curious when you say it has the hallmarks of LLMs since I also recognise a lot of them but not so much here

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

I write games in C (yes, C)

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

France's homegrown open source online office suite

72M Points of Interest

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

I write games in C (yes, C)

Software factories and the agentic moment

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Reinforcement Learning from Human Feedback

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

We mourn our craft

Coding agents have replaced every framework I used

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

France's homegrown open source online office suite

72M Points of Interest

The AI boom is causing shortages everywhere else

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

History and Timeline of the Proco Rat Pedal (2021)

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

Build your own Siri locally and on-device

Comments