On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.
Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.
Incidentally, waiting for Apple to blow this all up with native STT any day now. :)
Have you ever considered using a foot-pedal for PTT?
Apple incidentally already has native STT, but for some reason they just don't use a decent model yet.
Apparently they do have a better model, they just haven't exposed it in their own OS yet!
https://developer.apple.com/documentation/speech/bringing-ad...
Wonder what's the hold up...
For footpedal:
Yes, conceptually it’s just another evdev-trigger source, assuming the pedal exposes usable key/button events.
Otherwise we’d bridge it into the existing external control interface. Either way, hooks are there. :)
Parakeet does both just fine.
Not sure how you're running it, via whichever "app thing", but...
On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.
This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.
Maybe you can try hackin' that up?
Extra bonus is that Handy lets add an automatic LLM post-processor. This is very handy for the Parakeet V3 model, which can sometimes have issues where it repeats words or makes recognition errors for example, duplicating the recognition of a single word a dozen dozen dozen dozen dozen dozen dozen dozen times.
Project repo: https://github.com/finnvoor/yap
EDIT: I see there is an open issue for that on github
I've been using parakeet v3 which is fantastic (and tiny). Confused still seeing whisper out there.
Also vibe coded a way to use parakeet from the same parakeet piper server on my grapheneos phone https://zach.codes/p/vibe-coding-a-wispr-clone-in-20-minutes
It's also in many flavours, from tiny to turbo, and so can fit many system profiles.
That's what makes it unique and hard to replace.
The macOS built-in TTS (dictation) seems better than all the 3rd party, local apps I tried in the past that people raved about. I have tried several.
Is this better somehow?
If the 3rd party apps did streaming with typing in place and corrections within a reasonable window when they understand things better given more context, that would be cool. Theoretically, a custom model or UX could be "better" than what comes free built into macOS (more accurate or customizable).
But when I contacted the developer of my favorite one they said that would be pretty hard to implement due to having to go back and make corrections in the active field, etc.
I assume streaming STT in these utilities for Mac will get better at some point, but I haven't seen it yet (been waiting). It seems these tools generally are not streaming, e.g. they want you to finish speaking first before showing you anything. Which doesn't work for me when I'm dictating. I want to see what I've been saying lately, to jog my memory about what I've just said and help guide the next thing I'm about to say. I certainly don't want to split my attention by manually toggling the control (whether PTT or not) periodically to indicate "ok, you can render what I just said now".
I guess "hold-to-talk" tools are for delivering discrete, fully formed messages, not for longer, running dictation.
AFAICT, TFA is focused on hold-to-talk as the differentiator, over double-tap to begin speaking and double-tap to end speaking?
You could hook it up to some workflow over the local API depending on how you want to dump the text, but the web UI is good too.
The Show HN by the author was at: https://news.ycombinator.com/item?id=44145564
E.G. if your name is `Donold` (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.
Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.
We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.
https://developers.openai.com/cookbook/examples/whisper_prom...
The button next to it pastes when I press it. If I press it again, it hits the enter command.
You can get a lot done with two buttons.
Would you consider making available a video showing someone using the app?
I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!
What makes the others vastly better?
charlietran•2h ago