I must say, speaker diarization is surprisingly tricky to do. The most common approach seems to be to use pyannote, but the quality is not amazing...
yt-dlp --write-auto-subs --skip-download "https://www.youtube.com/watch?v=7xTGNNLPyMI"
(with that said, I do not want to diminish OP's work in any way; great job! "What I cannot build, I do not understand" - Feynman)
I ended up doing the same as this person, downloading the MP4s and then transcribing myself. I was assuming it was some sort of anti LLM scraper feature they put in place.
Has anyone used this --write-auto-subs flag and not been flagged after doing 20 or so videos?
My startup has to utilize youtube transcriptions so we just subscribe to a youtube transcriptor api hosted on rapidapi that downloads subtitles. 1$ per 1000 reqs. Pretty cheap
I wrote this webapp that uses this method: it calls Gemini in the background to polish the raw transcript and produce a much better version with punctuation and paragraphs.
https://www.appblit.com/scribe
Open source with code to see how to fetch from YouTube servers from the browser https://ldenoue.github.io/readabletranscripts/
systemctl start tor
yt-dlp --proxy socks5://127.0.0.1:9050 --write-subs --write-auto-subs --skip-download [URL]
See: https://github.com/noobpk/auto-change-tor-ipI have been tackling this while building VideoToBe.com. My current pipeline is Download Video -> Whisper Transcription with diarization -> Replace speaker tags with AI generated speaker ID + human fallback.
Reliable ML speaker identification is still surprisingly hard. For podcast summarization, speaker ID is a game-changer vs basic YT transcripts.
(I'm using it in https://butter.sonnet.io)
And unlike how your tool will be supported in the future, thousands of users make sure yt-dlp keeps working as google keep changing the site (currently 1459 contributors).
youtube also blocks transcript exports for some things like https://youtubetranscript.com/
retranscribing is necessary and important part of the creator toolset.
- This python one is more amenable to modding into your own custom tool: https://hw.leftium.com/#/item/44353447
- Another bash script: https://hw.leftium.com/#/item/41473379
---
They all seem to be built on top of:
- yt-dlp to download video
- whisper for transcription
- ffmpeg for audio/video extraction/processing
https://github.com/Dicklesworthstone/bulk_transcribe_youtube...
I ended up turning a beefed up version of it which makes polished written documents from the raw transcript, you can try it at
https://en.m.wikipedia.org/wiki/Specht_v._Netscape_Communica...
“We also note that in order to be guilty of accessing ‘without authorization, or in excess of authorization’ under New Jersey law, the Government needed to prove that Auernheimer or Spitler circumvented a code- or password-based barrier to access... The account slurper simply accessed the publicly facing portion of the login screen and scraped information that AT&T unintentionally published.”
https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
For Apple Silicon (MLX) https://huggingface.co/senstella/parakeet-tdt-0.6b-v2-mlx
I'd be really curious to see some sort of benchmark / evaluation of these context resources against the same coding tasks. Right now, the instructions all sound so prescriptive and authoritative, yet is really hard to evaluation their effectiveness.
Fetching video metadata...
Downloading from YouTube...
Generating transcript using medium model...
=== System Information ===
CPU Cores: 10
CPU Threads: 10
Memory: 15.8GB
PyTorch version: 2.7.1+cpu
PyTorch CUDA available: False
MPS available: False
MPS built: False
Falling back to CPU only
Model stored in: /home/app/.cache/whisper
Loading medium model into CPU...
100%|| 1.42G/1.42G [02:05<00:00, 12.2MiB/s]
Model loaded, transcribing...
Model size: 1457.2MB
Transcription completed in 468.70 seconds
=== Video Metadata ===
Title: 厨师长教你:“酱油炒饭”的家常做法,里面满满的小技巧,包你学会炒饭的最香做法,粒粒分明!
Channel: Chef Wang 美食作家王刚
Upload Date: 20190918
Duration: 5:41
URL: https://www.youtube.com/watch?v=1Q-5eIBfBDQ
=== Transcript ===
哈喽大家好我是王刚本期视频我跟大家分享...Patient: “Doctor, it hurts when I do this.”
Doctor: “don’t do that”
Doctor: do this
Patient: I tried doing this and it's not good
Doctor: actually you need a device for $5000 lol
Uses yt-dlp, whisper, and a LLM (Gemini hardcoded because it handles long contexts well, but easy to switch) for summarizer.
I dislike podcast as a format (S/N level way too low for my taste), so use this whenever I want to get a tldr of some episode.
I should check out the SOTA models and improve the summarization prompt, but aren't in a hurry as this works pretty well for my needs already.
cmaury•6mo ago
Bluestein•6mo ago
And, yes, indeed, AI-coding is order-of-magnitude having an effect along the lines that "low-code" was treading ...
... also, for less-capable coders or "borderline" coders the effort/benefit equation has radically shifted.-
sannysanoff•6mo ago
https://old.reddit.com/r/ChatGPTCoding/comments/1lusr07/self...
Gonna be lots of posts of selfware like that soon.
Bluestein•6mo ago
sannysanoff•6mo ago
cmaury•6mo ago