For text/srt subtitles, translation would probably be easier. There's a plugin for that already if you're okay with online translation services: https://github.com/nopium/vlc-trans-lua
If set, the transcription output will be sent to the specified file or URL
(use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages.
The output will also be set in the "lavfi.whisper.text" frame metadata.
If the destination is a file and it already exists, it will be overwritten.
@item format
The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json".
Default value: @code{"text"}
I don't know if this can embed the subtitles, but it does support generating accompanying srt files.Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper.
It's up to the site admin to configure it that way, but it's possible some IP ranges/user agents are more often used by bots and therefore have an increased weight.
For old browsers there's also an option to use meta refresh instead of JS (https://anubis.techaro.lol/docs/admin/configuration/challeng...) but that's quite a recent addition and not enabled by default.
I'm currently roaming in Finland with a Spanish SIM so would have expected the opposite in that case.
This page loaded pretty much instantly (certainly in the time it took to switch to the background tab I loaded in). But then ffmpeg is written by old school engineers with old school ways of working. Their social media accounts are a hilarity of trolling worthy of slashdot in its peak.
https://web.archive.org/web/20250813104007/https://code.ffmp...
You can read it on one of these without having to pass that specific bot check
With the current broken default config my browser can't even run the JS challenge due to it using unsupported bleeding edge JS features.
Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?
https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...
From the documentation:
> It runs automatic speech recognition using the OpenAI's Whisper model.
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
queue
The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.
The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.
E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).
This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.
https://tomwh.uk/git/whisper-chunk.git/
I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.
Unfortunately, you're only getting attention in 3 second chunks.
That said, I haven't run into the icecream problem with Whisper. Plenty of other systems fail but Whisper just seems to get lucky and guess the right words more than anything else.
The Google Meet/Android speech recognition is cool but terribly slow in my experience. It also has a tendency to over-correct for some reason, probably because of the "best of N" system you mention.
I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
│
└── Dey well; Be well
A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.
Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).
(This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)
As far as how it happens to me is concerned, either something closer to speech than raw thoughts reports back the data in shared memory is invalid for selected language, or I find there's no text representation exist for what I am trying to say.
The "raw" thoughts work with the currently active language, for me, so at least for me, I just know strong Sapir-Whorf hypothesis is not even a hypothesis, but just a reasonable verbalization closely matching my own observations.
I don't get why people can't take it, even in the age of LLMs. It is what it is and that old guy is just never correct even for once.
(then there's also a feedback loop type of argument, that always happens when discussing any sort of perception-reality distinction, but let's ignore that for now)
At least for me, my brain is so bad and it's hard for me to truly hold a single thought in my head for a long time. Maybe it eventually settles into my subconscious but I don't really have a way to verify that.
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
"How to wreck a nice beach you sing calm incense"
(Agree that the title is awesome, by the way!)
"Threesomes, with and without blame"
https://dl.acm.org/doi/10.1145/1570506.1570511
(From a professor I worked with a bit in grad school)
Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?
It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!
The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.
And when I'm watching subtitles in my own language (say because I want the volume low so I'm not disturbing others), I hate when the words I see don't match the words I hear. It's the quickest way I can imagine to get sucked out of the content and into awareness of the delivery of the content.
Sometimes they're edited down simply for space, because there wouldn't be time to easily read all the dialog otherwise. And sometimes repetition of words or phrases is removed, because it's clearer, and the emphasis is obvious from watching the moving image. And filler words like "uh" or "um" generally aren't included unless they were in the original script.
Most interestingly, swearing is sometimes toned down, just by skipping it -- removing an f-word in a sentence or similar. Not out of any kind of puritanism, but because swear words genuinely come across as more powerful in print than they do in speech. What sounds right when spoken can sometimes look like too much in print.
Subtitles are an art. Determining when to best time them, how to split up long sentences, how to handle different speakers, how to handle repetition, how to handle limited space. I used to want subtitles that were perfectly faithful to what was spoken. Then I actually got involved in making subtitles at one point, and was very surprised to discover that perfectly faithful subtitles didn't actually do the best job of communicating meaning.
Fictional subtitles aren't court transcripts. They serve the purpose of storytelling, which is the combination of a visible moving image full of emotion and action, and the subtitles. Their interplay is complex.
That's the thing though, subtitles aren't intended as full transcripts. They are intended to allow a wide variety of people to follow the content.
A lot of people read slower than they would hear speech. So subtitles often need to condense or rephrase speech to keep pace with the video. The goal is usually to convey meaning clearly within the time available on screen. Not to capture every single word.
If they tried to be fully verbatim, you'd either have subtitles disappearing before most viewers could finish reading them or large blocks of text covering the screen. Subtitlers also have to account for things like overlapping dialogue, filler words, and false starts, which can make exact transcriptions harder to read and more distracting in a visual medium.
I mean, yeah in your own native language I agree it sort of sucks if you can still hear the spoken words as well. But, to be frank, you are also the minority group here as far as subtitle target audiences go.
And to be honest, if they were fully verbatim, I'd wager you quickly would be annoyed as well. Simply because you will notice how much attention they then draw, making you less able to actually view the content.
If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
Unless it was trained end-to-end on dutch-subtitled english text?? Which might make the translation a somewhat inextricable part of the model..? Does anyone know?
That's how I anecdotally feel and interpret how my own brain appear to work, so it could be different from how interpreters work or how actual human brains work, but as far as I see it, professional simultaneous interpreters don't seem to be agnostic for relevant pairs of languages at all.
"Madam, please believe me, maine homework kiya ha" [I did my homework].
I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.
It seems so unnecessary if you're not making novelty videos about cats.
Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.
I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.
I think it's a toolkit thing where some sort of event or timer goes off at the wrong time and the subtitles get cleared when they shouldn't. And then if you rewind and replay, it doesn't happen again (because spurious event/timer issue).
I don't disagree, yet here we are. It's got race condition vibes.
I don't know if it's related to the TV OS (LG WebOS in our case) but I guess that would be the common factor since it happens across multiple apps and languages.
Anyway, it's quirky and occasionally annoying, but that's about it. :)
Must be union thing.
It's also annoying that you have to pay for Netflix when you can get the same movies for free with less restrictions on a pirate site.
Those are still cool IMO
https://kyutai.org/next/stt is natively streaming STT.
I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.
A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.
It's decent for classification but poor at transcription.
It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.
I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-ma...
Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.
The conversion process from pronunciation to intended text is not deterministic either, so it probably can't be solved by "simply" generating all-pronunciation outputs. Maybe a multimodal LLM as ASR/STT, or a novel dual input as-spoken+estimated-text validation model could be made? I wouldn't know, though. It seemed like a semi-open question.
You can also transcribe it to Japanese and use a translator to convert to English. This can sometimes help for more semantically complex dialogue.
For example, using faster-whisper-xxl [1]:
Direct translation:
faster-whisper-xxl.exe --language English --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>
Use Japanese, then translate: faster-whisper-xxl.exe --language Japanese --task translate --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>
1. https://github.com/Purfview/whisper-standalone-winAnother option is to use something like VideoToTextAI which allows you to transcribe it fast and then translate it into 100+ languages which you can then export the subtitle (SRT) file for
│
└── Dey well; Be well
https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomme...
People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.
HOWTO:
Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.
EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.
EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.
Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.
There's also a great podcast app opportunity here I hope someone seizes.
I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.
10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.
You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.
Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.
uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):
> uv pip install torch torchvision torchaudio --torch-backend=auto
More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automa...
This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.
But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.
The team at Astral should be nominated for a Nobel Peace Prize.
One life-changing thing I've been using `uv` for:
System python version is 3.12:
$ python3 --version
Python 3.12.3
A script that requires a library we don't have, and won't work on our local python: $ cat test.py
#!/usr/bin/env python3
import sys
from rich import print
if sys.version_info < (3, 13):
print("This script will not work on Python 3.12")
else:
print(f"Hello world, this is python {sys.version}")
It fails: $ python3 test.py
Traceback (most recent call last):
File "/tmp/tmp/test.py", line 10, in <module>
from rich import print
ModuleNotFoundError: No module named 'rich'
Tell `uv` what our requirements are $ uv add --script=test.py --python '3.13' rich
Updated `test.py`
`uv` updates the script: $ cat test.py
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "rich",
# ]
# ///
import sys
from rich import print
if sys.version_info < (3, 13):
print("This script will not work on Python 3.12")
else:
print(f"Hello world, this is python {sys.version}")
`uv` runs the script, after installing packages and fetching Python 3.13 $ uv run test.py
Downloading cpython-3.13.5-linux-x86_64-gnu (download) (33.8MiB)
Downloading cpython-3.13.5-linux-x86_64-gnu (download)
Installed 4 packages in 7ms
Hello world, this is python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]
And if we run it with Python 3.12, we can see that errors: $ uv run --python 3.12 test.py
warning: The requested interpreter resolved to Python 3.12.3, which is incompatible with the script's Python requirement: `>=3.13`
Installed 4 packages in 7ms
This script will not work on Python 3.12
Works for any Python you're likely to want: $ uv python list
cpython-3.14.0b2-linux-x86_64-gnu <download available>
cpython-3.14.0b2+freethreaded-linux-x86_64-gnu <download available>
cpython-3.13.5-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin/python3.13
cpython-3.13.5+freethreaded-linux-x86_64-gnu <download available>
cpython-3.12.11-linux-x86_64-gnu <download available>
cpython-3.12.3-linux-x86_64-gnu /usr/bin/python3.12
cpython-3.12.3-linux-x86_64-gnu /usr/bin/python3 -> python3.12
cpython-3.11.13-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bin/python3.11
cpython-3.10.18-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/bin/python3.10
cpython-3.9.23-linux-x86_64-gnu <download available>
cpython-3.8.20-linux-x86_64-gnu <download available>
pypy-3.11.11-linux-x86_64-gnu <download available>
pypy-3.10.16-linux-x86_64-gnu <download available>
pypy-3.9.19-linux-x86_64-gnu <download available>
pypy-3.8.16-linux-x86_64-gnu <download available>
graalpy-3.11.0-linux-x86_64-gnu <download available>
graalpy-3.10.0-linux-x86_64-gnu <download available>
graalpy-3.8.5-linux-x86_64-gnu <download available>
It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.
The main limitation is being english only.
winget install --id=Nikse.SubtitleEdit -e
Last I looked into it, the main options required API access to external services, which put me off. I think it was pyannotate.audio[1].
whisperx input.mp3 --language en --diarize --output_format vtt --model large-v2
Works a treat for Zoom interviews. Diarization is sometimes a bit off, but generally its correct.Run Whisper audio transcriptions with one FFmpeg command
https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254
Remind me of one of my own experiences with one of the Whisper model, where some random noise in the middle of the conversation was translated into "Don't forget to like and subscribe".
Really illustrate where the training data is coming from.
ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar 16000 -ac 1 -c:a pcm_s16le - \
| ./main - \
| head -2 \
| tail -1 \
| cut -d] -f2 \
| awk '{$1=$1};1'
The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).Edit: -t 5 controls recording duration.
Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....
> which I then....
Yes, please, go on...The LLM can screw up now and then and output absolute garbage. But I've got a knack now for figuring out what prompts it's gonna be hopeless on and I manually enter those.
Example:
Saying
Remove makhana from shopping list
Ends up running the command
gkeep items edit shopping_list --check makhana
There is a direct text interface too that skips the voice transcription.
The main thing is it does in a background window without interrupting my screen or me needing to wait for whatever slow webpage to load. I had it do a few things on GitHub like remind me when checks pass on PRs. You could potentially connect it to various things like your amazon account to check on your order, etc,.. as I write this I now realise I did what basically amounts to what folks do with MCP today. Maybe I should update it to use the protocol.
These days I have a little more idle time as a grad student than I did in a tech company, and I don't really need to manage home/cooking/... so I don't really use some of the more complicated features. I mostly just use it to schedule 1on1s with my guide and add reminders about assignments and TA work and talks and my music class.
Anyone found a way?
I could share a python script that is working pretty reliably for me.
https://code.ffmpeg.org/FFmpeg/FFmpeg/issues
I still see their old one too, but Forgejo one is nice.
Basically a simple audio-to-text for personal use?
I tried several times to get this into a reasonable shape, but all have been failures. If anyone has pointers I really appreciate it.
Other than for the "live transcription" usecase (that they made unnecessarily complicated), I don't see how this is any better than running Whisper.cpp directly. Other people in this thread are basically saying "ffmpeg's interface is better understood" [2] but LLMs make that point moot since you can just ask them to do the drudgery for you.
[1] https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
That said, I suppose I'm glad they're concentrating on making the ffmpeg code better rather than fixing bugs in the web interface for the development tracker. Having whisper integrated will be really useful. I'm already imagining automatic subtitle generation... imagining because I can't read the page or the code to know what it is.
1. git clone whisper.cpp
2. Make sure they have all dependencies for `that` library
3. Hope the build passes
4. Download the actual model
AND only then be able to use `-af "whisper=model...` filter.
If they try to use the filter without all the prereqs they'll fail and it'll create frustration.
It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.
brew install uv
uv tool install openai-whisper
then add ~/.local/bin/ to $PATH
https://developer.apple.com/documentation/speech/speechtrans...
https://developer.apple.com/documentation/speech/speechanaly...
https://www.macstories.net/stories/hands-on-how-apples-new-s...
ggap•9h ago