We built OWhisper for 2 reasons: (Also outlined in https://docs.hyprnote.com/owhisper/what-is-this)
(1). While working with on-device, realtime speech-to-text, we found there isn't tooling that exists to download / run the model in a practical way.
(2). Also, we got frequent requests to provide a way to plug in custom STT endpoints to the Hyprnote desktop app, just like doing it with OpenAI-compatible LLM endpoints.
The (2) part is still kind of WIP, but we spent some time writing docs so you'll get a good idea of what it will look like if you skim through them.
For (1) - You can try it now. (https://docs.hyprnote.com/owhisper/cli/get-started)
bash
brew tap fastrepl/hyprnote && brew install owhisper
owhisper pull whisper-cpp-base-q8-en
owhisper run whisper-cpp-base-q8-en
If you're tired of Whisper, we also support Moonshine :)
Give it a shot (owhisper pull moonshine-onnx-base-q8)We're here and looking forward to your comments!
yujonglee•5mo ago
These are list of local models it supports:
- whisper-cpp-base-q8
- whisper-cpp-base-q8-en
- whisper-cpp-tiny-q8
- whisper-cpp-tiny-q8-en
- whisper-cpp-small-q8
- whisper-cpp-small-q8-en
- whisper-cpp-large-turbo-q8
- moonshine-onnx-tiny
- moonshine-onnx-tiny-q4
- moonshine-onnx-tiny-q8
- moonshine-onnx-base
- moonshine-onnx-base-q4
- moonshine-onnx-base-q8
phkahler•5mo ago
To me, STT should take a continuous audio stream and output a continuous text stream.
yujonglee•5mo ago
Whisper and Moonshine both works in a chunk, but for moonshine:
> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.
Also for kyutai, we can input continuous audio in and get continuous text out.
- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...
mijoharas•5mo ago
(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)
yujonglee•5mo ago
For just transcribing file/audio,
`owhisper run <MODEL> --file a.wav` or
`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`
might makes sense.
mijoharas•5mo ago
yujonglee•5mo ago
https://github.com/fastrepl/hyprnote/blob/8bc7a5eeae0fe58625...
ctbellmar•5mo ago
https://github.com/bikemazzell/skald-go/
Just speech to text, CLI only, and it can paste into whatever app you have open.
mijoharas•5mo ago
What exactly does the silence detection mean? does that mean it'll wait until a pause, and then send the audio off to whisper, and return the output (and stop the process)? Same question with continuous. Does that just mean it continues going until CTRL+C?
Nvm, answered my own question, looks like yes for both[0][1]. Cool this seems pretty great actually.
[0] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...
[1] https://github.com/bikemazzell/skald-go/blob/main/pkg/skald/...
zveyaeyv3sfye•5mo ago
The short duration effectively means that the transcription will start producing nonsense as soon as a sentence is cut up in the middle.
alkh•5mo ago
yujonglee•5mo ago
shekhar101•5mo ago
But the base-q8 works (and works quite well!). The TUI is really nice. Speaker diarization would make it almost perfect for me. Thanks for building this.
yujonglee•5mo ago