frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: I built a sub-500ms latency voice agent from scratch

https://www.ntik.me/posts/voice-agent
101•nicktikhonov•3h ago
I built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). That’s with full STT → LLM → TTS in the loop, clean barge-ins, and no precomputed responses.

What moved the needle:

Voice is a turn-taking problem, not a transcription problem. VAD alone fails; you need semantic end-of-turn detection.

The system reduces to one loop: speaking vs listening. The two transitions - cancel instantly on barge-in, respond instantly on end-of-turn - define the experience.

STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation.

TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80ms TTFT was the single biggest win.

Geography matters more than prompts. Colocate everything or you lose before you start.

GitHub Repo: https://github.com/NickTikhonov/shuo

Follow whatever I next tinker with: https://x.com/nick_tikhonov

Comments

MbBrainz•2h ago
Love it! Solving the latency problem is essential to making voice ai usable and comfortable. Your point on VAD is interesting - hadn't thought about that.
NickNaraghi•2h ago
Pretty exciting breakthrough. This actually mirrors the early days of game engine netcode evolution. Since latency is an orchestration problem (not a model problem) you can beat general-purpose frameworks by co-locating and pipelining aggressively.

Carmack's 2013 "Latency Mitigation Strategies" paper[0] made the same point for VR too: every millisecond hides in a different stage of the pipeline, and you only find them by tracing the full path yourself. Great find with the warm TTS websocket pool saving ~300ms, perfect example of this.

[0]: https://danluu.com/latency-mitigation/

jangletown•2h ago
impressive
lukax•2h ago
Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.

https://soniox.com/docs/stt/rt/endpoint-detection

Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.

https://www.daily.co/blog/benchmarking-stt-for-voice-agents/

You can try a demo on the home page:

https://soniox.com/

Disclaimer: I used to work for Soniox

Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.

nicktikhonov•1h ago
If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.
lukax•1h ago
Sorry, I commented too soon. Did you also try Soniox? Why did you decide to use Deepgram's Flux (English only)?
nicktikhonov•1h ago
I didn't try Soniox, but I made a note to check it out! I chose Flux because I was already using Deepgram for STT and just happened to discover it when I was doing research. It would definitely be a good follow-up to try out all the different endpointing solutions to see what would shave off additional latency and feel most natural.

Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:

https://research.nvidia.com/labs/adlr/personaplex/

loevborg•1h ago
Nice write-up, thanks for sharing. How does your hand-vibed python program compare to frameworks like pipecat or livekit agents? Both are also written in python.
nicktikhonov•1h ago
I'm sure LiveKit or similar would be best to use in production. I'm sure these libraries handle a lot of edge cases, or at least let you configure things quite well out of the box. Though maybe that argument will become less and less potent over time. The results I got were genuinely impressive, and of course most of the credit goes to the LLM. I think it's worth building this stuff from scratch, just so that you can be sure you understand what you'll actually be running. I now know how every piece works and can configure/tune things more confidently.
perelin•1h ago
Great writeup! For VAD did you use heaphone/mic combo, or an open mic? If open, how did you deal with the agent interupting itself?
nicktikhonov•1h ago
I was using Twilio, and as far as I'm aware they handle any echos that may arise. I'm actually not sure where in the telephony stack this is handled, but I didn't see any issues or have to solve this problem myself luckily.
boznz•1h ago
"Voice is an orchestration problem" is basically correct. The two takeaways from this for me are

1. I wonder if it could be optimised more by just having a single language, and

2. How do we get around the problem of interference, humans are good at conversation discrimination ie listing while multiple conversations, TV, music, etc are going on in the background, I've not had too much success with voice in noisy environments.

modeless•1h ago
IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.
nicktikhonov•1h ago
If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:

https://research.nvidia.com/labs/adlr/personaplex/

woodson•1h ago
You mean Moshi (https://github.com/kyutai-labs/moshi)? Since Personaplex is just a finetuned Moshi model.
mountainriver•21m ago
Yeah except moshi doesn’t sound good at all
age123456gpg•1h ago
Hi all! Check out this Handy app https://github.com/cjpais/Handy - a free, open source, and extensible speech-to-text application that works completely offline.

I am using it daily to drive Claude and it works really-well for me (much better than macOS dictation mode).

armcat•1h ago
This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova
nicktikhonov•1h ago
Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like
docheinestages•1h ago
Does anyone know about a fully offline, open-source project like this voice agent (i.e. STT -> LLM -> TTS)?
nicktikhonov•1h ago
A friend built this, everything working in-browser:

https://ttslab.dev/voice-agent

CagedJean•1h ago
Do you have hot talk when you are alone in the shower with HER?
nicktikhonov•1h ago
Gross
shubh-chat•32m ago
This is superb, Nick! Thanks for this. Will try it out at somepoint for a project I am trying to build.

The workers behind Meta's smart glasses can see everything

https://www.svd.se/a/K8nrV4/metas-ai-smart-glasses-and-data-privacy-concerns-workers-say-we-see-e...
344•sandbach•1h ago•179 comments

Dragon Ball Color Correction Process [pdf]

https://andrewvanner.github.io/som/SoM_CC_Process_Day.pdf
49•haunter•1h ago•6 comments

Welcome (back) to Macintosh

https://take.surf/2026/03/01/welcome-back-to-macintosh
200•Udo_Schmitz•3h ago•115 comments

Closure of the Weatherradio Service in Canada

https://www.rac.ca/rac-responds-to-the-closure-of-the-weatherradio-service-in-canada/
31•da768•1h ago•14 comments

Show HN: I built a sub-500ms latency voice agent from scratch

https://www.ntik.me/posts/voice-agent
104•nicktikhonov•3h ago•26 comments

British Columbia to end time changes, adopt year-round daylight time

https://www.cbc.ca/news/canada/british-columbia/b-c-adopting-year-round-daylight-time-9.7111657
349•ireflect•3h ago•193 comments

The 185-Microsecond Type Hint

https://blog.sturdystatistics.com/posts/type_hint/
27•kianN•2h ago•2 comments

First in-utero stem cell therapy for fetal spina bifida repair is safe: study

https://health.ucdavis.edu/news/headlines/first-ever-in-utero-stem-cell-therapy-for-fetal-spina-b...
238•gmays•9h ago•47 comments

New iPad Air, powered by M4

https://www.apple.com/newsroom/2026/03/apple-introduces-the-new-ipad-air-powered-by-m4/
304•Garbage•10h ago•502 comments

RCade: Building a Community Arcade Cabinet

https://www.frankchiarulli.com/blog/building-the-rcade/
29•evakhoury•4d ago•2 comments

Show HN: Govbase – Follow a bill from source text to news bias to social posts

https://govbase.com
144•foxfoxx•7h ago•65 comments

Show HN: Visual Lambda Calculus – a thesis project (2008) revived for the web

https://github.com/bntre/visual-lambda
10•bntr•2d ago•3 comments

Programmable Cryptography

https://0xparc.org/writings/programmable-cryptography-1
33•fi-le•2d ago•11 comments

Show HN: Pianoterm – Run shell commands from your Piano. A Linux CLI tool

https://github.com/vustagc/pianoterm
35•vustagc•3h ago•15 comments

"That Shape Had None" – A Horror of Substrate Independence (Short Fiction)

https://starlightconvenience.net/#that-shape-had-none
74•casmalia•5h ago•13 comments

Astro and Svelte: Why I believe they're the future of web development

https://xergioalex.com/blog/astro-and-svelte-the-future-of-web-development/
6•xergioalex•52m ago•0 comments

Motorola announces a partnership with GrapheneOS

https://motorolanews.com/motorola-three-new-b2b-solutions-at-mwc-2026/
2030•km•17h ago•725 comments

Ask HN: Who is hiring? (March 2026)

154•whoishiring•8h ago•207 comments

How to Build Your Own Quantum Computer

https://physics.aps.org/articles/v19/24
4•tzury•1h ago•0 comments

LFortran compiles fpm

https://lfortran.org/blog/2026/02/lfortran-compiles-fpm/
44•wtlin•3d ago•19 comments

Reflex (YC W23) Is Hiring Software Engineers – Python

https://www.ycombinator.com/companies/reflex/jobs
1•apetuskey•7h ago

Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering

https://maderix.substack.com/p/inside-the-m4-apple-neural-engine
255•zdw•1d ago•62 comments

Launch HN: OctaPulse (YC W26) – Robotics and computer vision for fish farming

56•rohxnsxngh•7h ago•30 comments

iPhone 17e

https://www.apple.com/newsroom/2026/03/apple-introduces-iphone-17e/
177•meetpateltech•10h ago•187 comments

Show HN: uBlock filter list to blur all Instagram Reels

https://gist.github.com/shraiwi/009c652da6ce8c99a6e1e0c86fe66886
88•shraiwi•4h ago•21 comments

Ask HN: Who wants to be hired? (March 2026)

59•whoishiring•8h ago•165 comments

Build your own Command Line with ANSI escape codes (2016)

https://www.lihaoyi.com/post/BuildyourownCommandLinewithANSIescapecodes.html
32•vinhnx•2d ago•10 comments

Parallel coding agents with tmux and Markdown specs

https://schipper.ai/posts/parallel-coding-agents/
108•schipperai•10h ago•83 comments

Packaging a Gleam app into a single executable

https://www.dhzdhd.dev/blog/gleam-executable
80•todsacerdoti•8h ago•7 comments

Use the Mikado Method to do safe changes in a complex codebase

https://understandlegacycode.com/blog/a-process-to-do-safe-changes-in-a-complex-codebase/
150•foenix•4d ago•68 comments