Show HN: Whispering – Open-source, local-first dictation you can trust

https://github.com/epicenter-so/epicenter/tree/main/apps/whispering

591•braden-w•5mo ago

Hey HN! Braden here, creator of Whispering, an open-source speech-to-text app.

I really like dictation. For years, I relied on transcription tools that were almost good, but they were all closed-source. Even a lot of them that claimed to be “local” or “on-device” were still black boxes that left me wondering where my audio really went.

So I built Whispering. It’s open-source, local-first, and most importantly, transparent with your data. Your data is stored locally on your device, and your audio goes directly from your machine to a local provider (Whisper C++, Speaches, etc.) or your chosen cloud provider (Groq, OpenAI, ElevenLabs, etc.). For me, the features were good enough that I left my paid tools behind (I used Superwhisper and Wispr Flow before).

Productivity apps should be open-source and transparent with your data, but they also need to match the UX of paid, closed-software alternatives. I hope Whispering is near that point. I use it for several hours a day, from coding to thinking out loud while carrying pizza boxes back from the office.

Here’s an overview: https://www.youtube.com/watch?v=1jYgBMrfVZs, and here’s how I personally am using it with Claude Code these days: https://www.youtube.com/watch?v=tpix588SeiQ.

There are plenty of transcription apps out there, but I hope Whispering adds some extra competition from the OSS ecosystem (one of my other OSS favorites is Handy https://github.com/cjpais/Handy). Whispering has a few tricks up its sleeve, like a voice-activated mode for hands-free operation (no button holding), and customizable AI transformations with any prompt/model.

Whispering used to be in my personal GH repo, but I recently moved it as part of a larger project called Epicenter (https://github.com/epicenter-so/epicenter), which I should explain a bit...

I’m basically obsessed with local-first open-source software. I think there should be an open-source, local-first version of every app, and I would like them all to work together. The idea of Epicenter is to store your data in a folder of plaintext and SQLite, and build a suite of interoperable, local-first tools on top of this shared memory. Everything is totally transparent, so you can trust it.

Whispering is the first app in this effort. It’s not there yet regarding memory, but it’s getting there. I’ll probably write more about the bigger picture soon, but mainly I just want to make software and let it speak for itself (no pun intended in this case!), so this is my Show HN for now.

I just finished college and was about to move back with my parents and work on this instead of getting a job…and then I somehow got into YC. So my current plan is to cover my living expenses and use the YC funding to support maintainers, our dependencies, and people working on their own open-source local-first projects. More on that soon.

Would love your feedback, ideas, and roasts. If you would like to support the project, star it on GitHub here (https://github.com/epicenter-so/epicenter) and join the Discord here (https://go.epicenter.so/discord). Everything’s MIT licensed, so fork it, break it, ship your own version, copy whatever you want!

Comments

solarkraft•5mo ago

Cool! I just started becoming interested in local transcription myself.

If you add Deepgram listen API compatibility, you can do live transcription via either Deepgram (duh) or OWhisper: https://news.ycombinator.com/item?id=44901853

(I haven’t gotten the Deepgram JS SDK working with it yet, currently awaiting a response by the maintainers)

braden-w•5mo ago

Thank you for checking it out! Coincidentally, it's on the way:

https://github.com/epicenter-so/epicenter/pull/661

In the middle of a huge release that sets up FFMPEG integration (OWhisper needs very specifically formatted files), but hoping to add this after!

braden-w•5mo ago

For those checking out the repo this morning, I'm in the middle of a release that adds Whisper C++ support!

https://github.com/epicenter-so/epicenter/pull/655

After this pushes, we'll have far more extensive local transcription support. Just fixing a few more small things :)

teiferer•5mo ago

You mentioned that you got into YC .. what is the road to profitability for your project(s) if everything is open source and local?

Johnny_Bonk•5mo ago

Great work! I've been using Willow Voice but I think I will migrate to this (much cheaper) but they do have a great UI or UX just by hitting a key to start recording and the context goes into whatever text input you want. I haven't installed whispering yet but will do so. P.S

braden-w•5mo ago

Amazing, thanks for giving it a try! Let me know how it goes and feel free to message me any time :) happy to add any features that you miss from closed-source altneratives!

newman314•5mo ago

Does Whispering support semantic correction? I was unable to find confirmation while doing a quick search.

braden-w•5mo ago

Hmm, we support prompts at both 1. the model level (the Whisper supports a "prompt" parameter that sometimes works) and 2. transformations level (inject the transcribed text into a prompt and get the output from an LLM model of your choice). Unsure how else semantic correction can be implemented, but always open expand the feature set greatly over the next few weeks!

joshred•5mo ago

They might not now how whisper works. I suspect that the answer to their question is 'yes' and the reason they can't find a straightforward answer through your project is that the answer is so obvious to you that it's hardly worth documenting.

Whisper for transcription tries to transform audio data into LLM output. The transcripts generally have proper casing, punctuation and can usually stick to a specific domain based on the surrounding context.

dumbmrblah•5mo ago

I’ve been using whispering for about a year now, it has really changed how I interact with the computer. I make sure to buy mice or keyboards that have programmable hotkeys so that I can use the shortcuts for whispering. I can’t go back to regular typing at this point, just feels super inefficient. Thanks again for all your hard work!

braden-w•5mo ago

Thank you so much for your support! It really means a lot :) Happy to hear that it's helped you, and keep in touch if you ever have any issues!

glial•5mo ago

This is wonderful, thank you for sharing!

Do you have any sense of whether this type of model would work with children's speech? There are plenty of educational applications that would value a privacy-first locally deployed model. But, my understanding is that Whisper performs pretty poorly with younger speakers.

braden-w•5mo ago

Thank you! And you’re right, I think Whisper struggles with younger voices. Haven’t tested Parakeet or other models for this yet, but that’s a great use case (especially since privacy matters in education). I would also shoutout Hypernote! (https://hyprnote.com/) They might be expanding their model options, as they have shown with OWhisper (https://docs.hyprnote.com/owhisper/what-is-this).

codybontecou•5mo ago

Now we just need text to speech so we can truly interact with our computers hands free.

PyWoody•5mo ago

If you're on Mac, you can use `say`, e.g.,

    say "This is a test message" --voice="Bubbles"

EDIT: I'm having way too much fun with this lol

    say "This is a test message" --voice="Organ"
    say "This is a test message" --voice="Good News"
    say "This is a test message" --voice="Bad News"
    say "This is a test message" --voice="Jester"

braden-w•5mo ago

LOL that's pretty funny, thank you for the share!

Aachen•5mo ago

    $ apt install espeak-ng
    $ espeak-ng 'Hello, World!'

It takes some adjustment and sounds a lot worse than what e.g. Google ships proprietarily on your phone, but after ~30 seconds of listening (if I haven't used it recently) I understand it just as well as I understand the TTS engine on my phone

If there's a more modern package that sounds more human that's a similar no-brainer to install, I'd be interested, but just to note that this part of the problem has been solved for many years now, even if the better-sounding models are usually not as openly licensed, orders of magnitude more resource-intensive, limited to a few languages, and often less reliable/predictable in their pronunciation of new or compound words (usually not all of these issues at once)

0xbadcafebee•5mo ago

  $ apt install festival
  $ echo "Hello, World!" | festival --tts

Not impressively better, but I find festival slightly more intelligible.

Aachen•5mo ago

Will give it a spin, thanks!

0xbadcafebee•5mo ago

I also just found something that sounds genuinely realistic: Piper (https://github.com/OHF-Voice/piper1-gpl/tree/main). It's slow but apparently you can run it as a daemon to be faster, and it integrates with Home Assistant and Speech Dispatcher.

  $ sudo apt update
  $ sudo apt install -y python3 python3-pip libsndfile1 ffmpeg
  $ python -m venv piper-tts
  $ ./venv/piper-tts/bin/pip install piper-tts
  $ ./venv/piper-tts/bin/python3 -m piper.download_voices en_US-lessac-medium
  $ ./venv/piper-tts/bin/piper -m en_US-lessac-medium -- 'This will play on your speakers.'

To manage the install graphically, you can use Pied (https://pied.mikeasoft.com/), which has a snap and a flatpak. That one's really cool because you can choose the voice graphically which makes it easy to try them out or switch voices. To play sound you just use "spd-say 'Hello, world!'"

More crazy: Home Assistant did a "Year of Voice" project (https://www.home-assistant.io/blog/2022/12/20/year-of-voice/) that culminated in a real open-source voice assistant product (https://www.home-assistant.io/voice-pe/) !!! And it's only $60??

Aachen•5mo ago

I've tried Piper using this app: https://f-droid.org/packages/org.woheller69.ttsengine

It has some mispronounciations in the texts I tried to listen to, besides using so much RAM that it kills basically all other apps to make space for this. Not really worth it when espeak is already understandable

Festival I've tried in the meantime and doesn't support Dutch or German, two of the three languages I use regularly. I keep coming back to espeak at the only option that will simply always work xD

wkcheng•5mo ago

Does this support using the Parakeet model locally? I'm a MacWhisper user and I find that Parakeet is way better and faster than Whisper for on-device transcription. I've been using push-to-transcribe with MacWhisper through Parakeet for a while now and it's quite magical.

polo•5mo ago

+1 for MacWhisper. Very full featured, nice that it's a one time purchase, and the developer is constantly improving it.

daemonologist•5mo ago

Parakeet is amazing - 3000x real-time on an A100 and 5x real-time even on a laptop CPU, while being more accurate than whisper-large-v3 (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard). NeMo is a little awkward though; I'm amazed it runs locally on Mac (for MacWhisper).

wkcheng•5mo ago

Yeah, Parakeet runs great locally on my M1 laptop (through MacWhisper). Transcription speed of recordings feel at least 10x faster than Whisper, and the accuracy is better as well. Push to talk for dictation is pretty seamless since the model is so fast. I've observed no downside to Parakeet if you're speaking English.

SebKba•5mo ago

Parakeet v3 supports many more languages. Works great with MacWhisper.

mark212•5mo ago

seems like "not yet" is the answer from other comments

braden-w•5mo ago

Not yet, but I want it too! Parakeet looks incredible (saw that leaderboard result). My current roadmap is: finish stabilizing whisper.cpp integration, then add Parakeet support. If anyone has bandwidth to PR the connector, I’d be thrilled to merge it.

Bolwin•5mo ago

Unfortunately, because it's Nvidia, parakeet doesn't work with Whisper.cpp as far as I'm aware. You need onnx

braden-w•5mo ago

Some lovely folks have left some other open-source projects that implement Parakeet. I would recommend checking those out! I'll also work on my own implementation in the meantime :D

warangal•5mo ago

A bit tangential statement, about parakeet and other Nvidia Nemo models, i never found actual architecture implementations as pytorch/tf code, seems like all such models, are instant-ized from a binary blob making it difficult to experiment! Maybe i missed something, does anyone here have more experience with .nemo models to shed some more light onto this?

satisfice•5mo ago

Windows Defender says it is infected.

sa-code•5mo ago

This needs to be higher, the installer on the README has a trojan.

fencepost•5mo ago

More details please? Which installer?

---7.3.0--- This release popped up just a few minutes ago, so VirusTotal results for the 7.3.0 EXE and MSI installers

EXE (still running behavior checks but Arctic Wolf says Unsafe and AVG & Avast say PUP): https://www.virustotal.com/gui/file/816b21b7435295d0ac86f6a8...

MSI nothing flags immediately, still running behavior checks (https://www.virustotal.com/gui/file/e022a018c4ac6f27696c145e...)

---7.2.2/7.2.1 below--- I do note one bit of weirdness, the Windows downloads show 7.2.2 but the download links themselves are 7.2.1. 7.2.1 is also what shows on the release from 3 days ago even though it's numbered 7.2.2.

I didn't check the Mac or Linux installers, but for Windows VirusTotal flags nothing on the 7.2.1/7.2.2 MSI (https://www.virustotal.com/gui/file/7a2d4fec05d1b24b7deda202...) and 3 flags on the EXE (ArcticWolf Unsafe, AVG & Avast PUP) (https://www.virustotal.com/gui/file/a30388127ad48ca8a42f9831...)

braden-w•5mo ago

Need to run a diff against 7.2.2 tag against 7.3.0; I suspect the issue might be something related to an edit I made on `tauri.conf.json` or one of my Rust dependencies.

braden-w•5mo ago

We're actively tracking this issue here:

https://github.com/epicenter-so/epicenter/issues/440

Thank you again for bringing this to my attention! Need to step up my Windows development.

fencepost•5mo ago

Keep in mind that AVG and Avast are owned by the same company now so overlap is likely.

I'm not at my computer but IIRC there was mention of connecting out to a couple of initially unidentifiable domains but a little digging makes it seem they're poorly documented but related to visual studio analytics. I ignored after seeing that.

barryfandango•5mo ago

I'm no expert, but since it acts as a keyboard wedge it's likely to be unpopular with security software.

braden-w•5mo ago

Ahh that's unfortunate. This most likely is related to the rust `enigo` create, which we use to write text to the cursor. You can see the lines in question here: https://github.com/epicenter-so/epicenter/blob/60f172d193d88...

If it's still an issue, feel free to build it locally on your machine to ensure your supply chain is clean! I'll add more instructions in the README in the future.

hexfish•5mo ago

What does Virustotal say?

mrs6969•5mo ago

am I not getting it correctly; it says local is possible but can't find any information about how to run it without any api key?

I get the whispers models, and do what? how to run in a device without internet, no documentation about it...

rpdillon•5mo ago

The docs are pretty clear that you need to use speaches if you want entirely local operation.

https://speaches.ai/

yunohn•5mo ago

It’s not very clear, rather just a small mention. Given OP’s extensive diatribe about local-first, the fact that it prefers online providers is quite a big miss tbh.

braden-w•5mo ago

Yeah I agree, I neglected to update the docs and demo. This post was made anticipating the local transcription feature to drop earlier but it took some time due to some bugs. Before, the default option was using Groq for transcription, but that was admittedly before I figured out local transcription and wanted something to work in the meantime. Will be changing local as the default strategy in the documentation.

mrs6969•5mo ago

Agreed.

On the other hand, kudos to developer, already working to make it happen!

rpdillon•5mo ago

Yeah, apologies, I did read through the vast majority of the readme before replying. I realize this is not common at all.

braden-w•5mo ago

Commented this earlier, but I'm in the middle of a release that adds Whisper C++ support! https://github.com/epicenter-so/epicenter/pull/655

After this pushes, we'll have far more extensive local transcription support. Just fixing a few more small things :)

ericd•5mo ago

Awesome, this takes Whispering from something I'd probably not bother with to something I'd consider integrating into my daily workflow. Thanks very much for the tool!

random3•5mo ago

are there any non-Whisper-based voice models/tech/APIs?

braden-w•5mo ago

Yes, we currently support OpenAI/ElevenLabs/Deepgram APIs that all use non-Whisper models (presumedly) under the hood. Speaches also supports other models that are not Whisper. Hopefully adding Parakeet support later too!

michael-sumner•5mo ago

How does this compare to VoiceInk which is also open-source and been there much longer and supports all the features that you have? https://github.com/Beingpax/VoiceInk

phainopepla2•5mo ago

One thing that immediately stands out is VoiceInk is macOS only, while Whispering supports Linux and Windows in addition to macOS

oulipo•5mo ago

I really like VoiceInk!

For the Whispering dev: would it be possible to set "right shift" as a toggle? also do it like VoiceInk which is:

- either short right shift press -> then it starts, and short right shift press again to stop - or "long right shift press" (eg when at pressed at least for 0.5s) -> then it starts and just waits for you to release right shift to stop

it's quite convenient

another really cool stuff would be to have the same "mini-recorder" which pops-up on screen like VoiceInk when you record, and once you're done it would display the current transcript, and any of your "transformation" actions, and let you choose which one (or multiple) you want to apply, each time pasting the result in the clipboard

d4rkp4ttern•5mo ago

VoiceInk (one time payment) and WisprFlow (subscription) are currently my fav dictation apps. I just looked at Whispering and have to say VoiceInk is far superior to Whispering in terms of Ux, and clarity of settings, so I think VoiceInk deserves at least as much attention. There are several things that make a huge difference things that make a huge difference in dictation apps, besides the obvious speed and accuracy:

- allow flexible recording toggle shortcuts - show a visual icon with waves etc showing recording - how the clipboard is handled during recording (does it copy to clipboard? does it clear it after text output?)

VoiceInk is nearly there in terms of good behavior on these dimensions, and I hope to ditch my Wispr Flow sub soon.

ideashower•5mo ago

Is there speaker detection?

braden-w•5mo ago

Diarization is on the roadmap! Some providers support it, but some don't and the adapter for that could be tricky. Currently, for diarization I use the Elevenlabs Scribe API https://elevenlabs.io/app/speech-to-text, but there are surely other options

ideashower•5mo ago

Do you know if there's any kind of writing about the different types of diarization methods?

tummler•5mo ago

Related, just as a heads up. I've been using this for 100% local offline transcription for a while, works well: https://github.com/pluja/whishper

braden-w•5mo ago

Awesome, thank you so much for bringing this to my attention and including it in the thread! Always cool to see other open source projects :)

chrisweekly•5mo ago

> "I think there should be an open-source, local-first version of every app, and I would like them all to work together. The idea of Epicenter is to store your data in a folder of plaintext and SQLite, and build a suite of interoperable, local-first tools on top of this shared memory. Everything is totally transparent, so you can trust it."

Yes! This. I have almost no experience w/ tts, but if/when I explore the space, I'll start w/ Whispering -- because of Epicenter. Starred the repo, and will give some thought to other apps that might make sense to contribute there. Bravo, thanks for publishing these and sharing, and congrats on getting into YC! :)

spullara•5mo ago

IF you do want to then ALSO have a cloud version, you can just use the AgentDB API and upload them there and just change where the SQL runs.

sebastiennight•5mo ago

I think we're talking about STT (speech-to-text) here, not TTS.

chrisweekly•5mo ago

whoops! absolutely correct, that's what I meant.

braden-w•5mo ago

Thanks so much for the support! Really appreciate the feedback, and it’s great to hear the vision resonates. No worries on the STT/TTS experience; it’s just awesome to connect with someone who shares the values of open-source and owning our data :) I’m hoping my time in YC can be productive and, along the way, create more support for other OSS developers too. Keep in touch!

dev0p•5mo ago

That's a good idea... Just git repo your whole knowledge base and build on top of it.

marcodiego•5mo ago

> I’m basically obsessed with local-first open-source software.

We all should be.

braden-w•5mo ago

Agreed!

satvikpendem•5mo ago

All these all just Whisper wrappers? I don't get it, the underlying model still isn't as good as paid custom models from companies, is there an actual open source / weights alternative to Whisper for speech to text? I know only of Parakeet.

sa-code•5mo ago

Voxtral mini is a bit bigger but their mixed language demos looked super impressive https://mistral.ai/news/voxtral

braden-w•5mo ago

We like Whisper because it's open-source :) but we also support OpenAI 4o-transcribe/ElevenLabs/Deepgram APIs that all use non-Whisper models (presumedly) under the hood. Speaches also supports other models that are not Whisper. Hopefully adding Parakeet support later too!

ayushrodrigues•5mo ago

I've been interested in a tool like this for a while. I currently have tried whisprflow and aqua voice but wanted to use my API key and store more context locally. How does all the data get stored and how can I access it?

braden-w•5mo ago

The data is currently stored in IndexedDB, and you can currently only access it through the user interface (or digging into system files). However, I'm hoping in future updates, all of the transcriptions will instead be stored as markdown files in your local file system. More on that later!

dllthomas•5mo ago

Can it tell voices apart?

hephaes7us•5mo ago

Speaker diarization is the term you are looking for, and this is more difficult than simple transcription. I'm rather confident that someone probably has a good solution by now (if you want to pay for an API), but I haven't seen an open-source/open-weights tool for diarization/transcription. I looked a few months ago, but things move fast...

dllthomas•5mo ago

Thanks, that, yeah. I've looked occasionally but it's been a bit. Necessary feature in a house with a 9yo. I've been thinking about taking a swing at solving my problem without solving the general problem.

braden-w•5mo ago

Diarization is on the roadmap; some providers support it but some don't and the adapter for that could be tricky. Whispering is not meant for meeting notes for now; for something like that or diarization I would recommend trying Hyprnote: https://hyprnote.com or interfacing with the Elevenlabs Scribe API https://elevenlabs.io/app/speech-to-text

dllthomas•5mo ago

I'm not looking for attributed meeting notes, so much as making it harder for a passing child to inject content.

oulipo•5mo ago

Really nice!

For OsX there is also the great VoiceInk which is similar and open-source https://github.com/Beingpax/VoiceInk/

jiehong•5mo ago

Very similar and works well. It’s a bring your own API key if you want/need. Also with local whisper.

braden-w•5mo ago

Awesome, thank you so much for bringing this to my attention! Cool to see another open source project that has different implementations :) much to learn with their Parakeet implementation!

jnmandal•5mo ago

Looks like a really cool project. Do you have any opinions on which transcription models are the best, from a quality perspective? I have heard a lot of mixed opinions on this. Curious what you've found in your development process?

braden-w•5mo ago

I'm a huge fan of using Whisper hosted on Groq since the transcription is near instantaneous. ElevenLabs' Scribe model is also particularly great with accuracy, and I use it for high-quality transcriptions or manually upload files to their API to get diarization and timestamps (https://elevenlabs.io/app/speech-to-text). That being said, I'm not the biggest expert on models. In my day-to-day workflow, I usually swap between Whisper C++ for local transcription or Groq if I want the best balance of speed/performance, unless I'm working on something particularly sensitive.

jnmandal•5mo ago

Nice. Yeah, we are dogfooding some systems I built in my household. We use whisper.cpp and I haven't had any issues. I get told frequently I should be using eleven labs but I just have been too lazy to build a benchmark that would help me decide

hereme888•5mo ago

Earlier today I discovered Vibe: https://github.com/thewh1teagle/vibe

Local, using WhisperX. Precompiled binaries available.

I'm hoping to find and try a local-first version of an nvidia/canary like (like https://huggingface.co/nvidia/canary-qwen-2.5b) since it's almost twice as fast as Whisper with even lower word-error-rate

icelancer•5mo ago

Been using WhisperX myself for years. The big factor is the diarization they offer through pyannotate in the single package. I do like the software even if they make some weird choices and configuration issues.

Allegedly Groq will be offering diarization with their cloud offering and super fast API which will be huge for those willing to go off-local.

braden-w•5mo ago

Awesome, thank you so much for bringing this to my attention! Always cool to see other open source projects that have better implementations :) much to learn!

Aachen•5mo ago

Wait, I'm confused. The text here says all data remains on device and emphasises how much you can trust that, that you're obsessed with local-first software, etc. Clicking on the demo video, step one is... configuring access tokens for external services? Are the services shown at 0:21 (Groq, OpenAI, Antrophic, Google, ElevenLabs) doing the actual transcription, listening to everything I say, and is only the resulting text that they give us subject to "it all stays on your device"? Because that's not at all what I expected after reading this description

IanCal•5mo ago

> All your data is stored locally on your device, and your audio goes directly from your machine to your chosen cloud provider (Groq, OpenAI, ElevenLabs, etc.) or local provider (Speaches, owhisper, etc.)

Their point is they aren’t a middleman with this, and you can use your preferred supplier or run something locally.

bangaladore•5mo ago

The issue is

> All your data is stored locally on your device,

is fundamentally incapable with half of the following sentence.

I'd write it as

> All your data is stored locally on your device, unless you explicitly decide to use a cloud provider for dictation.

braden-w•5mo ago

Great correction, wish I could edit the post! Updated the README to reflect this.

Leftium•5mo ago

The local transcription feature via whisper.cpp was just released 2 hours ago: https://github.com/epicenter-so/epicenter/releases/tag/v7.3....

braden-w•5mo ago

Great catch Aachen, I should have clarified this better. The app supports both external APIs (Groq, OpenAI, etc.), and more recently local transcription (via whisper.cpp, OWhisper, Speaches, etc.), which never leaves your device.

Like Leftium said, the local-first Whisper C++ implementation just posted a few hours ago.

dang•5mo ago

We've edited the top text to make this clearer now. Thanks for pointing this out!

0xbadcafebee•5mo ago

Not a fan of high resource use or reliance on proprietary vendors/services. DeepSpeech/Vosk were pre-AI and still worked well on local devices, but they were a huge pain to set up and use. Anyone have better versions of those? Looks like one successor was Coqui STT, which then evolved into Coqui TTS which seems still maintained. Kaldi seems older but also still maintained.

edit: nvm, this overview explains the different options: https://www.gladia.io/blog/best-open-source-speech-to-text-m... and https://www.gladia.io/blog/thinking-of-using-open-source-whi...

braden-w•5mo ago

Sorry for the delayed response, thank you for sharing these articles! I agree. I hope that we get a lot better open-source STT options in the future.

hephaes7us•5mo ago

Thanks for sharing! Transcription suddenly became useful to me when LLMs started being able to generate somewhat useful code from natural language. (I don't think anybody wants to dictate code.) Now my workflow is similar to yours.

I have mixed feelings about OS-integration. I'm currently working on a project to use a foot-pedal for push-to-transcribe - it speaks USB-HID so it works anywhere without software, and it doesn't clobber my clipboard. That said, an app like yours really opens up some cool possibilities! For example, in a keyboard-emulation strategy like mine, I can't easily adjust the text prompt/hint for the transcription model.

With an application running on the host though, you can inject relevant context/prompts/hints (either for transcription, or during your post-transformations). These might be provided intentionally by the user, or, if they really trust your app, this context could even be scraped from what's currently on-screen (or which files are currently being worked on).

Another thing I've thought about doing is using a separate keybind (or button/pedal) that appends the transcription directly to a running notes file. I often want to make a note to reference later, but which I don't need immediately. It's a little extra friction to have to actually have my notes file open in a window somewhere.

Will keep an eye on epicenter, appreciate the ethos.

NDxTreme•5mo ago

If you want a rabbit hole to go down, looking into cursorless, talonvoice and that whole sphere.

Actually dictating code, but they do it in a rather smart way.

braden-w•5mo ago

Thank you for the support, and agreed on OS-level integration. At least for me, I have trouble trusting any app unless they are open source and have a transparent codebase for audit :)

emacsen•5mo ago

Tried it with AppImage on Linux, attempted to download a model and "Failed to download model. An error occurred." but nothing that helps me track down the error :(

emacsen•5mo ago

Same with the deb. :(

braden-w•5mo ago

Thanks for flagging this, and sorry that this is happening! Does downloading the model manually work? I wonder if it's related to this:

https://github.com/epicenter-so/epicenter/issues/669

emacsen•5mo ago

I don't think it's the same error, but without a good error message I don't know.

I did manually download the models and associated them, which are great but then the audio didn't work. On the browser version, it never asks me for permission for an audio device, and on the native version, it makes a file of 0 length and then complains it can't read the contents.

My read is that the project looks very interesting, and I'd love a FLOSS replacement for Aqua Voice, but this software isn't ready for everyday use yet, at least not on Linux.

I'd love to help somehow, whether that's a donation or experimenting if you point me to somewhere.

Tmpod•5mo ago

I've been interested in dictation for a while, but I don't want to be sending any audio to a remote API, it all has to be local. Having tried just a couple of models (namely the one used by the FUTO Keyboard), I'm kinda feeling like we're not quite there yet.

My biggest gripe perhaps is not being able to get decent content out of a thought stream; the models can't properly filter out the pauses, "uuuuhmms", and much less so handle on the fly corrections to what I've been saying, like going back and repeating something with a slight variation and whatnot.

This is a challenging problem I'd love to see being tackled well by open models I can run on my computer or phone. Are there new models more capable of this? Is it not just a model thing, and I missing a good app too?

In the meanwhile, I'll keep typing, even though it can be quite a bit less convenient to do; especially true for note taking on the go.

hephaes7us•5mo ago

Have you tried Whisper itself? It's open-weights.

One of the features of the project posted above is "transformations" that you can run on transcripts. They feed the text into an LLM to clean it up. If you're willing to pay for the tokens, I think you could not only remove filler-words, but could probably even get the semantically-aware editing (corrections) you're talking about.

braden-w•5mo ago

^Yep, unfortunately, the best option right now seems to pipe the output into another LLM to do some cleanup, which we try to help you do in Whispering. Recent transcription models don't have very good built-in inference/cleanup, with Whisper having the very weak "prompt" parameter. It seems like this is probably by design to keep these models lean/specialized/performant in their task.

_345•5mo ago

By try to help, do you mean that it currently does so or that functionality is otw

Jarwain•5mo ago

Yes yes yes please so much yes.

I love the idea of epicenter. I love open source local-first software.

Something I've been hacking on for a minute would fit so well, if encryption wasn't a requirement for the profit model.

But uh yes thank you for making my life easier, and I hope to return the favor soon

braden-w•5mo ago

Thank you so much for the support! It really means a lot to me. And I can't wait to hear about what you're building. Feel free to DM me and Discord when the time comes :)

hn1986•5mo ago

excellent tool and easy to get started.

on win11, i installed ffmpeg using winget but it's not detecting it. running ffmpeg -version works but the app doesn't detect it.

one thing, how can we reduce the number of notifications received?

i like the system prompt option too.

braden-w•5mo ago

Thank you for the support! Sorry for the issues with FFmpeg. This is an active issue that we're tracking:

https://github.com/epicenter-so/epicenter/issues/674

We hope to fix notifications too thank you for the feedback and happy to hear you liked the system prompt!

pabs3•5mo ago

Are there any speech-to-text models that are fully OSS for everything from training data/code to model weights?

https://salsa.debian.org/deeplearning-team/ml-policy

braden-w•5mo ago

Not that I know of. I think the two most prominent open-source models that we hear about are Whisper and Parakeet!

pabs3•5mo ago

Whisper doesn't list its training data (or code?), so can't be an open-source model, just an open weights model.

Parakeet does list its training data, and at least one of those is not FOSS, but some of them definitely are FOSS. I wonder if they nVidia would create a fully FOSS model by retraining on only the open data.

https://huggingface.co/nvidia/parakeet-rnnt-1.1b#datasets https://catalog.ldc.upenn.edu/LDC2004T19 https://catalog.ldc.upenn.edu/license/ldc-non-members-agreem...

g48ywsJk6w48•5mo ago

Thank you for sharing such a great product. Last week after getting fed up with a lot of slow commercial products and wrote my own similar app that works locally in the loop and can record everything I say at the push of a button, transcribe it and put this into the app itself. And for me it was really important to create a second mode so I could speak everything I want in my mother tongue and that gets translated into English automatically. Of course, it all works with formatting, with the placement of commas, quote, etc. It is hard to believe that this hasn't been done in a native dictation app on macOS yet.

braden-w•5mo ago

Thank you so much for the support, really means a lot! Happy to hear that it has helped you with translation, and agreed, it's kinda crazy native dictation hasn't caught on yet. In the meantime, we have OSS to fill in the gaps.

Brajeshwar•5mo ago

I’m beginning to like the idea in this space — local first with a backup with your own tool. Recently, https://hyprnote.com was popular here on Hacker News and it is pretty good. They also do the same, works local-first but you can use your preferred tool too.

braden-w•5mo ago

Totally agreed, huge fan of Hyprnote as well. We work on two slightly different problems, but a lot of our tech has overlap, and our missions especially overlap :)

jryio•5mo ago

Does this functionality exist on iOS ? I'm looking for an iOS app that wraps Parakeet or whisper in a custom iOS keyboard.

That way I can switch to the dictation keyboard, press dictate, and have the transcription inserted in any application (first or third party).

MacWhisper is fantastic for macOS system dictation but the same abilities don't exist on iOS yet. The native iOS dictation is quite good but not as accurate with bespoke technical words / acronyms as Whisper cpp.

nchudleigh•5mo ago

superwhisper has that functionality.

jryio•5mo ago

Right but not running locally on device. No privacy

braden-w•5mo ago

I really want to run it locally on a phone, but as a developer it's scary to think about making a native mobile app and having to work with the iOS toolchain I don't have bandwidth at the moment, but if anyone knows of any OSS mobile alternatives, feel free to drop them!

progx•5mo ago

Does additional scripts/ other tools exists that can do the following:

Record permanent the voice (without shortkey) e.g. "run" compile and run a script, "code" switch back to code editor.

Under windows i use AutoHotKey2, but i would replace it with simple voice commands.

pstroqaty•5mo ago

If anyone's interested in a janky-but-works-great dictation setup on Linux, here's mine:

On key press, start recording microphone to /tmp/dictate.mp3:

  # Save up to 10 mins. Minimize buffering. Save pid
  ffmpeg -f pulse -i default -ar 16000 -ac 1 -t 600 -y -c:a libmp3lame -q:a 2 -flush_packets 1 -avioflags direct -loglevel quiet /tmp/dictate.mp3 &
  echo $! > /tmp/dictate.pid

On key release, stop recording, transcribe with whisper.cpp, trim whitespace and print to stdout:

  # Stop recording
  kill $(cat /tmp/dictate.pid)
  # Transcribe
  whisper-cli --language en --model $HOME/.local/share/whisper/ggml-large-v3-turbo-q8_0.bin --no-prints --no-timestamps /tmp/dictate.mp3 | tr -d '\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'

I keep these in a dictate.sh script and bind to press/release on a single key. A programmable keyboard helps here. I use https://git.sr.ht/%7Egeb/dotool to turn the transcription into keystrokes. I've also tried ydotool and wtype, but they seem to swallow keystrokes.

  bindsym XF86Launch5 exec dictate.sh start
  bindsym --release XF86Launch5 exec echo "type $(dictate.sh stop)" | dotoolc

This gives a very functional push-to-talk setup.

I'm very impressed with https://github.com/ggml-org/whisper.cpp. Transcription quality with large-v3-turbo-q8_0 is excellent IMO and a Vulkan build is very fast on my 6600XT. It takes about 1s for an average sentence to appear after I release the hotkey.

I'm keeping an eye on the NVidia models, hopefully they work on ggml soon too. E.g. https://github.com/ggml-org/whisper.cpp/issues/3118.

genewitch•5mo ago

before i even bother opening that github: does it work on windows? so far, all of the whisper "clones" run poorly, if at all, on windows. I do have a 3060 and a 1070ti i could use just for whisper on linux, but i have this 3090 on my windows desktop that works "fine" for whisper, TTS, SD, LLM.

Whisper on windows, the openai-whisper, doesn't have these q8_0 models, it has like 8 models, and i always get an error about triton cores (something about timeptamping i guess), which windows doesn't have. I've transcribed >1000 hours of audio with this setup, so i'm used to the workflow.

generalizations•5mo ago

If you want to stay near the bleeding edge with this stuff, you probably want to be on some kind of linux (or lacking that, Mac). Windows is where stuff just trickles down to eventually.

genewitch•5mo ago

you're not wrong; i really should try just running linux and seeing how good the steam gaming layer is these days. And if SDR# runs on linux under wine or whatever.

braden-w•5mo ago

Thanks for sharing a great alternative! It seems that that setup can go a long way for Linux users.

Y_Y•5mo ago

This is my favorite kind of software, to write and to use. I am reminded of Upton Sinclair's evergreen quote:

> It is difficult to get a man to understand something, when his salary depends upon his not understanding it!

in the special case where the thing to be understood is "your app doesn't need to be a Big Fucking Deal". Maybe it pleases some users to wrap this in layers of additional abstraction and chrome and clicky buttons and storefronts, but in the end the functionality is already there with a couple of FOSS projects glued together in a bash script.

I used to think the likes of Suckless were brutalist zealots, but more and more I think they (and the Unix patriarchs) were right and the path to enlightenment is expressed in plain text.

hn_throw2025•5mo ago

Thanks, looks like great work! Hope you continue to cater for those of us with Intel Macs who need the off-device capability…

divan•5mo ago

As many other people commented on similar projects, one of the issues of trying to use voice dictation instead of typing is the lack of real-time visual indication. When we write, we immediately see the text, which helps to keep the thought (especially in longer sentences/paragraphs). But with dictation, it either comes with a delay or only when dictation is over, and it doesn't feel as comfortable as writing. Tangentially, many people "think as they write" and dictation doesn't offer that experience.

I wonder if it changes with time for people who use dictation often.

archerx•5mo ago

I think there is still some use to diction. For me it’s a great way to get screenplays on paper. I can type fast but I can think and speak faster. I just record a stream of thought of the story/video I want, even if I jump all over the place it doesn’t matter, just a nice stream of consciousness. Afterwards I spend time editing and putting things in the right order and clean up. I find this much faster than just writing.

I use whisperfile which is a multiplatform implementation of whisper that works really well.

https://huggingface.co/Mozilla/whisperfile

franga2000•5mo ago

There are many situations where dictation makes far more sense. Around here, all doctors dictate into a recorder (often with a foot pedal) that the nurse transcribes, because typing would be distracting and also unsanitary when examining the patient. Some have started using machine transcription, often in the cloud. This is terrible for privacy and security, even when it's "GDPR certified", whatever that means. Having a local option is amazing for that.

Similarly, I've used dictation when working on something physical, like reverse engineering some hardware, where my table is full of disassembled electronics, I might be carefully holding a probe or something like that, and having to put everything down just to write "X volts on probe Y" would slow me down.

mrgaro•5mo ago

I'd love to find a tool which could recognise a few different speakers so that I could automatically dictate 1:1 sessions. In addition, I definitively would want to feed that to an LLM to cleanup the notes (to remove all "umm" and similar nonsense) and to do context aware spell checking.

The LLM part should be very much doable, but I'm not sure if speaker recognition exists in a sufficiently working state?

torstenvl•5mo ago

Speaker "diarization" is what you're looking for, and currently the most popular solution is pyannote.audio.

Eventually I'm trying to get around to using it in conjunction with a fine-tuned whisper model to make transcriptions. Just haven't found the time yet.

ilyakaminsky•5mo ago

Shameless plug -- check out speechischeap.com

I spent three months perfecting the speaker diarization pipeline and I think you'll be quite pleased with the results.

diamondage•5mo ago

How well does it work with multiple languages?

okasaki•5mo ago

This is a cool project and I want go give it a go in my spare time.

However what gives me pause is the sheer number of possibly compromised microphones all around me (phones, tablets, laptops, tv etc) at all times, which makes spying much easier than if I use a keyboard.

PickledJesus•5mo ago

Great software, I've been using this since the start of this year, I use it every day, initially as a frustration with ChatGPT and Claude not having proper voice support in their desktop versions and then everywhere.

When you are in an environment where you can dictate, it really is a game changer. Not only is dictating much faster than typing, even if you're a fast typist, I find that you don't have the sticking problem of composing a message quite as much. It also makes my typing feel more like natural speech.

I have both the record and cancel actions bound to side buttons on my mouse, and paste to a third, the auto-paste feature is frustrating in my opinion.

I do miss having a taskbar icon to see if I'm recording or not. Sometimes I accidentally leave it running and sometimes the audio cues break until I restart it.

Transformations are great, despite an extreme amount of prompt engineering, I can't seem to stop the transformation model occasionally responding to my message rather than just transforming it though..

braden-w•5mo ago

Thank you for the support! I'm glad to hear that it's been helping you since the start of the year. Totally agree on the transformation prompts. It's challenging to get the transformation model to not occasionally get short-circuited, especially when I end up having it format a dictated prompt. Instead of formatting, it executes the prompt.

Sorry to hear about the auto-paste feature and taskbar icons. We'll try to restore these in the future, and you can track taskbar here:

https://github.com/epicenter-so/epicenter/issues/607

blueboo•5mo ago

I used Whispering routinely last year; the value + glitches and ux failures drove me to gladly pay for Superwhisper; whose rough iPhone keyboard drove me to Wispr Flow (and tried otter too); whose poor transcriptions (oh THATS why they’re fast) drove me back to Superwhisper

Still lots of quality headroom in this space. I’ll def revisit whispering

jagermo•5mo ago

This earned an upvote for the fantastic readme / installation guide alone. Very well done.

mrbig0•5mo ago

Among all the offline transcription apps I've tried, my favorite remains https://whispernotes.app. High accuracy, one-time purchase, and genuinely offline. I love its clean UI.

Honestly, I'm getting tired of subscription-based apps. If it's truly offline, shouldn't it support a one-time purchase model? The whole point of local-first is that you're not dependent on ongoing cloud services, so why structure pricing like you are?

That said, will definitely give Whispering a try - always happy to see more open source alternatives in this space, especially with the local whisper.cpp integration that just landed.

braden-w•5mo ago

Thanks for shouting out some other great alternatives! The UI looks really clean.

Right now, the pricing is entirely free, and we are trying to expand our local model support to make it truly free. Subscriptions are up to the user right now.

Thanks for giving us a shot, and no pressure on using it! At the end of the day, I just want to build something that is open source and trustworthy, and hopefully will fit into the Epicenter ecosystem, the data layer that I talked about earlier in my post.

wahnfrieden•5mo ago

Do you not see the case for Patreon at all? IAP subscriptions can be a form of Patreon for ongoing maintenance and R&D, even for offline applications.

I understand the fatigue but not the outright indignation.

danmeier•5mo ago

Hot take: I think all these dictation tools are solving the wrong problem: they're optimizing for accurate transcription (and latency) when users actually need intelligent interpretation. For example: People don't speak in perfect emails. They speak in scattered thoughts and intentions that require contextual understanding.

fwip•5mo ago

Doesn't an accurate transcription make it easier to reach understanding?

braden-w•5mo ago

I totally agree with this hot take. Whispering is not there yet, but I eventually want it to store as many of the transcripts as plain text markdown, alongside your audio files, in a folder.

The idea is that as we add more local-first apps into the ecosystem (writing, etc.), they're share this context. Transcription would benefit immensely if you also had a writing app that you could trust to store your data. To execute that vision, we needed a transcription app where we have control over how data is stored, and the best solution was to build our own.

jokethrowaway•5mo ago

I recommend adding support for nemo parakeet.

It's uncanny how good / fast it is

shinycode•5mo ago

It already exists with great execution :

https://github.com/kitlangton/Hex

It translates to proper language also

teiferer•5mo ago

Reposting here, maybe you missed it where I asked first:

You mentioned that you got into YC .. what is the road to profitability for your project(s) if everything is open source and local?

alnxdrawr•5mo ago

I would be very interested in a version of this that allow recording from both microphone and audio at the same time. Then it could get plugged into WhisperX for diarization..

But even just having anything that's being said recorded would be outstanding

ajolly•5mo ago

The killer feature I'm still looking for is software that will do voice to text, but insert the text into the text box that was active when I started talking, not ended talking.

That way I could click on a text box start talking but have the rest of my brain switch to doing a second task.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Total surface area required to fuel the world with solar (2009)

Bye Bye Humanity: The Potential AMOC Collapse

LLMs as the new high level language

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

First Proof

Vouch

Why there is no official statement from Substack about the data leak

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Homeland Security Spying on Reddit Users

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

FDA intends to take action against non-FDA-approved GLP-1 drugs

The AI boom is causing shortages everywhere else

Learning from context is harder than we thought

Where did all the starships go?

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Selection rather than prediction

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

I write games in C (yes, C) (2016)

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Total surface area required to fuel the world with solar (2009)

Bye Bye Humanity: The Potential AMOC Collapse

LLMs as the new high level language

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

First Proof

Vouch

Why there is no official statement from Substack about the data leak

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Homeland Security Spying on Reddit Users

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

FDA intends to take action against non-FDA-approved GLP-1 drugs

The AI boom is causing shortages everywhere else

Learning from context is harder than we thought

Where did all the starships go?

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Selection rather than prediction

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

I write games in C (yes, C) (2016)

Show HN: Whispering – Open-source, local-first dictation you can trust

Comments