That is ok for what is brings. Nice program. Very "handy".
CPU-optimized speech recognition with Parakeet models
While certainly it's not an ML project in the sense of I am not training models, the inference stack is just as important. The fact is the application does do inference using ONNX and Whisper.cpp.
Most of the audio code/inference code is Rust or bindings to libraries like whisper.cpp
https://addons.mozilla.org/en-CA/firefox/addon/read-aloud/
Read Aloud allows you to select from a variety of text-to-speech voices, including those provided natively by the browser, as well as by text-to-speech cloud service providers such as Google Wavenet, Amazon Polly, IBM Watson, and Microsoft. Some of the cloud-based voices may require additional in-app purchase to enable.
...
the shortcut keys ALT-P, ALT-O, ALT-Comma, and ALT-Period can be used to Play/Pause, Stop, Rewind, and Forward, respectively.
Amazing what it can do with only 82M parameters
Thank you!
FYI
Frontend: React + TypeScript with Tailwind CSS for the settings UI Backend: Rust for system integration, audio processing, and ML inference
it offers all the different sizes of openai models too
The default recommendation is Parakeet (mainly because it runs fast on a lot more hardware), but definitely think people should experiment with different models and see what is best for them. Personally I found Whisper Medium to be far better than Turbo and Large for my speech, and Parakeet is about on par with Medium, but each have their own quirks.
I'll update the site soon!
great onboarding too, using it now.
Very handy, thanks!
I mean... why would I want this app instead of some other app? Just because it's written in the language of the week? If it said "20% faster than xyz" it would be a much better marketing than saying it's written in rust, even though more than half the code is typescript.
To that group saying something is "made in rust" is equivalent to saying "it's modern, fast, secure, and made by an expert programmer not some plebe who can't keep up with the times"
Quite the opposite. You have to be more of an expert programmer to achieve those same goals in C. Rust lowers the skill bar.
Anyways, I agree that the editorialization here is silly.
But also, I am unashamed that "in Rust" does increase my interest in a piece of software, for several of the reasons you mentioned.
I stated my need for help on the about page as well
> This is my first Rust project, and it shows. There are bugs, rough edges, and architectural decisions that could be better. I’m documenting the known issues openly because I want everyone to understand what they are getting into, and encourage improvement in the project.
HN definitely likes it, when it is used in the correct context. Using Rust in the title is a soft promise for better reliability and quality for the software than on average. But it starts to get controversial when Rust is not purely the controlling part of the software anymore. So people start to complain because it can be misleading marketing which is based on the promise that Rust can offer.
So makes sense, but there are benefits to writing a desktop application backend in Rust for the ecosystem as well.
For me I do tend to prefer apps written in rust/go(/c/etc-copiled) as they are usually less problematic to install (quie often single binary; less headache compared to python stuff for example) and most of the time less resource hungry (anything JS/electron based)... in the end "convenient shortcut to convey aforementioned benefits" :)
In case you also have a problem with not using the original HN link: https://news.ycombinator.com/item?id=44302416
(I think the first link is easier to read (CSS/formatting/dark mode), slightly more compact, and contains a link to the original HN post. It's also simple to recreate the HN link manually by inspecting the ID.)
I guess there's no way for the AppImage to use GPU compute, right? Not that it matters much because parakeet is fast enough on CPU anyway.
(I'm unfamiliar with AppImage. Was the model included in the app image, or was there a download after selecting the model?)
Parakeet is currently CPU only
I wanted speech-to-text in arbitrary applications on my Linux laptop, and I realized that loading the model was one of the slowest parts. So a daemon process, which triggers recording on/off using SIGUSR2, records using `pw-record` and passes the data to a loaded whisper model, which finally types the text using `ydotool` turned out to be a relatively simple application to build. ~200 lines in Go, or ~150 in Rust (check history for Rust version).
https://extensions.gnome.org/extension/8238/gnome-speech2tex...
- it sets icon on the menubar - it display a window where I can choose which model to use
That's it. 120MB FOR doing nothing.
Why doesn't Excel appear instantly, and why is it 2.29GB now when Excel 98 for Mac was.. 154.31MB? Why is a LAN transfer between two computers still as slow as 1999, 10ishMB/s, when both can simultaneously download at > 100MB/s? I'm not starting with GB-memory-hoarding tabs, when you think about it, it's managed well as a whole, holding 700+ tabs without complaining.
And what about logs? This is a new branch of philosophy, open Console and witness the era of hyperreal siloxal, where computational potential expands asymptotically while user experience flatlines into philosophical absurdity?
A piece of evidence supporting this hypothesis: rsync (a program written by people who know their craft) on MacOS does essentially the same job as Time Machine, but the former is orders of magnitude faster than the latter.
While the UI is doing “nothing” most of the bloat is not from the UI
I could once optimize a solution to produce over 500x improvement. I cannot write about how this came, but it was much easier than initially expected.
See also: Wirth's Law: https://en.wikipedia.org/wiki/Wirth%27s_law
# whisper-live.sh: run once and it listens (blocking), run again and it stops listening.
if ! test -f whisper.quit ; then
touch whisper.quit
notify-send -a whisper "listening"
m="/usr/share/whisper.cpp-model-tiny.en-q5_1/ggml-tiny.en-q5_1.bin"
txt="$(ffmpeg -hide_banner -loglevel -8 -f pulse -i default -f wav pipe:1 < whisper.quit \
| whisper-cli -np -m "$m" -f - -otxt -sns 2>/dev/null \
| tr \\n " " | sed -e 's/^\s*//' -e 's/\s\s*$//')"
rm -f whisper.quit
notify-send -a whisper "done listening"
printf %s "$txt" | wtype -
else
printf %s q > whisper.quit
fi
You can trivially modify it to use wl-copy to copy to clipboard instead, if you prefer that over immediately sending the text to the current window. I set up sway to run a script like this on $mod+Shift+w so it can be done one-handed -- not push to listen, but the script itself toggles listen state on each invocation, so push once to start, again to stop.In theory, Handy could be developed by hand-rolling assembly. Maybe even binary machine code.
- It would probably be much faster, smaller and use less memory. But...
- It would probably not be cross-platform (Handy works on Linux, MacOS, and Windows)
- It would probably take years or decades to develop (Handy was developed by a single dev in single digit months for the initial version)
- It would probably be more difficult to maintain. Instead of re-using general purpose libraries and frameworks, it would all be custom code with the single purpose of supporting Handy.
- Also, Handy uses an LLM for transcription. LLM's are known to require a lot of RAM to perform well. So most of the RAM is probably being used by the transcription model. An LLM is basically a large auto-complete, so you need a lot of RAM to store all the mappings to inputs and outputs. So the hand-rolled assembly version could still use a lot of RAM...
Leftium•4mo ago
tempodox•4mo ago
How do you clear the history of recordings?
Leftium•4mo ago
sipjca•4mo ago