If set, the transcription output will be sent to the specified file or URL
(use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages.
The output will also be set in the "lavfi.whisper.text" frame metadata.
If the destination is a file and it already exists, it will be overwritten.
@item format
The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json".
Default value: @code{"text"}
I don't know if this can embed the subtitles, but it does support generating accompanying srt files.Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper.
It's up to the site admin to configure it that way, but it's possible some IP ranges/user agents are more often used by bots and therefore have an increased weight.
For old browsers there's also an option to use meta refresh instead of JS (https://anubis.techaro.lol/docs/admin/configuration/challeng...) but that's quite a recent addition and not enabled by default.
I'm currently roaming in Finland with a Spanish SIM so would have expected the opposite in that case.
https://web.archive.org/web/20250813104007/https://code.ffmp...
You can read it on one of these without having to pass that specific bot check
Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?
https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...
From the documentation:
> It runs automatic speech recognition using the OpenAI's Whisper model.
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
queue
The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.
The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.
E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).
This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.
https://tomwh.uk/git/whisper-chunk.git/
I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.
Unfortunately, you're only getting attention in 3 second chunks.
I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
│
└── Dey well; Be well
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
Unless it was trained end-to-end on dutch-subtitled english text?? Which might make the translation a somewhat inextricable part of the model..? Does anyone know?
I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.
It seems so unnecessary if you're not making novelty videos about cats.
Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.
I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.
https://kyutai.org/next/stt is natively streaming STT.
I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.
A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.
It's decent for classification but poor at transcription.
It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.
│
└── Dey well; Be well
ggap•1h ago