frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: What's the current best local/open speech-to-speech setup?

35•dsrtslnd23•13h ago
I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).

Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.

What are people actually using in 2026 if they want open + local voice?

Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?

If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?

What’s the most “works today” combo on a single GPU?

Bonus: rough numbers people see for mic → first audio back

Would love pointers to repos, configs, or “this is the one that finally worked for me” war stories.

Comments

jauntywundrkind•2h ago
It was a little annoying getting old qt5 tools installed but I really enjoyed using dsnote / Speech Note. Huge model selection for my amd gpu. Good tool. I haven't done enough specific studying yet to give you suggestions for which model to go with. WhisperFlow is very popular.

Kyutai some very interesting work always. Their delayed streams work is bleeding edge & sounds very promising especially for low latency. Not sure why I have not yet tried it tbh. https://github.com/kyutai-labs/delayed-streams-modeling

There's also a really nice elegant simple app Handy. Only supports Whisper and Parakeet V3 but nice app & those are amazing models. https://github.com/cjpais/Handy

mpaepper•2h ago
You should look into the new Nvidia model: https://research.nvidia.com/labs/adlr/personaplex/

It has dual channel input / output and a very permissible license

dsrtslnd23•2h ago
oh - very interesting indeed! thanks
cbrews•37m ago
Thanks for sharing this! I'm going to put this on my list to play around with. I'm not really an expert in this tech, I come from the audio background, but recently was playing around with streaming Speech-to-Text (using Whisper) / Text-to-Speech (using Kokoro at the time) on a local machine.

The most challenging part in my build was tuning the inference batch sizing here. I was able to get it working well for Speech-to-Text down to batch sizes of 200ms. I even implement a basic local agreement algorithm and it was still very fast (inferencing time, I think, was around 10-20ms?). You're basically limited by the minimum batch size, NOT inference time. Maybe that's a missing "secret sauce" suggested in the original post?

In the use case listed above, the TTS probably isn't a bottleneck as long as OP can generate tokens quickly.

All this being said a wrapped model like this that is able to handle hand-offs between these parts of the process sounds really useful and I'll definitely be interested in seeing how it performs.

Let me know if you guys play with this and find success.

Johnny_Bonk•56m ago
Anyone using any reasonably good small speech to text os models?
garblegarble•53m ago
For my inputs, whisper distil-large-v3.5 is the best. I tried Parakeet 0.6 v3 last night but it has higher error rates than I'd like (but it is fast...)
Johnny_Bonk•50m ago
Nice I'll try it, as of now for my personal stt workflow I use eleven labs api which is pretty generous but curious to play around with other options
garblegarble•39m ago
I assume that will be better than whisper - I haven't benchmarked it against cloud models, the project I'm working on cannot send data out to cloud models
BiraIgnacio•33m ago
oh I've been looking into whisper and vosk in the last few days. I'll probably go with whisper (with whisper.cpp) but has anyone compared it to vosk models?
amelius•51m ago
For the TTS part: https://github.com/supertone-inc/supertonic
hackomorespacko•10m ago
Just going out on the street and talk nigga?

Unrolling the Codex agent loop

https://openai.com/index/unrolling-the-codex-agent-loop/
166•tosh•3h ago•66 comments

New YC homepage

https://www.ycombinator.com/
133•sarreph•6h ago•61 comments

Banned C++ Features in Chromium

https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/c++-features.md
82•szmarczak•3h ago•61 comments

Proof of Corn

https://proofofcorn.com/
300•rocauc•6h ago•219 comments

Gas Town's agent patterns, design bottlenecks, and vibecoding at scale

https://maggieappleton.com/gastown
231•pavel_lishin•8h ago•257 comments

Some C habits I employ for the modern day

https://www.unix.dog/~yosh/blog/c-habits-for-me.html
30•signa11•4d ago•4 comments

Route leak incident on January 22, 2026

https://blog.cloudflare.com/route-leak-incident-january-22-2026/
107•nomaxx117•6h ago•24 comments

Minnesota activist releases arrest video after manipulated White House version

https://apnews.com/article/minnesota-activist-ice-protest-church-video-49faf3efd54e496388651aac13...
22•petethomas•20m ago•1 comments

Microsoft gave FBI set of BitLocker encryption keys to unlock suspects' laptops

https://techcrunch.com/2026/01/23/microsoft-gave-fbi-a-set-of-bitlocker-encryption-keys-to-unlock...
578•bookofjoe•6h ago•407 comments

Mental Models (2018)

https://fs.blog/mental-models/
33•hahahacorn•3h ago•8 comments

Certificate Transparency Log Explorer

https://certs.swerdlow.dev
11•benswerd•4h ago•3 comments

KORG phase8 – Acoustic Synthesizer

https://www.korg.com/us/products/dj/phase8/
176•bpierre•9h ago•84 comments

TrueVault (YC W14) is hiring a Growth Lead to test different growth channels

https://www.ycombinator.com/companies/truevault/jobs/njvSGDj-growth-lead
1•jason_wang•3h ago

Gold fever, cold, and the true adventures of Jack London in the wild

https://www.smithsonianmag.com/history/gold-fever-deadly-cold-and-amazing-true-adventures-jack-lo...
33•janandonly•5d ago•6 comments

Booting from a vinyl record (2020)

https://boginjr.com/it/sw/dev/vinyl-boot/
267•yesturi•13h ago•102 comments

Notes on the Intel 8086 processor's arithmetic-logic unit

https://www.righto.com/2026/01/notes-on-intel-8086-processors.html
69•elpocko•6h ago•9 comments

Killing the ISP Appliance: An eBPF/XDP Approach to Distributed BNG

https://markgascoyne.co.uk/posts/ebpf-bng/
63•chaz6•6h ago•15 comments

Proton Spam and the AI Consent Problem

https://dbushell.com/2026/01/22/proton-spam/
463•dbushell•17h ago•322 comments

The tech monoculture is finally breaking

http://www.jasonwillems.com/technology/2025/12/17/Tech-Is-Fun-Again/
106•at1as•8h ago•150 comments

Waypoint-1: Real-Time Interactive Video Diffusion from Overworld

https://huggingface.co/blog/waypoint-1
53•avaer•9h ago•14 comments

Nobody likes lag: How to make low-latency dev sandboxes

https://www.compyle.ai/blog/nobody-likes-lag/
55•mnazzaro•6h ago•29 comments

TikTok Is Now Collecting More Data About Its Users

https://www.wired.com/story/tiktok-new-privacy-policy/
39•coloneltcb•1h ago•9 comments

Floating-Point Printing and Parsing Can Be Simple and Fast

https://research.swtch.com/fp
81•chmaynard•4d ago•5 comments

Show HN: Whosthere: A LAN discovery tool with a modern TUI, written in Go

https://github.com/ramonvermeulen/whosthere
190•rvermeulen98•12h ago•69 comments

The SIM-to-real problem isn't about simulators – it's about behavior robustness

https://medium.com/@freefabian/introducing-the-concept-of-kinematic-fingerprints-8e9bb332cc85
3•fabotelli•4d ago•0 comments

Ask HN: What's the current best local/open speech-to-speech setup?

35•dsrtslnd23•13h ago•11 comments

Anthropic Economic Index report: economic primitives

https://www.anthropic.com/research/anthropic-economic-index-january-2026-report
104•malshe•1d ago•57 comments

European Alternatives

https://european-alternatives.eu
616•s_dev•11h ago•348 comments

Show HN: Zsweep – Play Minesweeper using only Vim motions

https://zsweep.com
61•oug-t•5d ago•28 comments

EquipmentShare (YC W15) goes public

https://www.ycombinator.com/blog/congratulations-to-equipmentshare/
10•subsequent•2h ago•8 comments