There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.
Nice. Any blog post, twitter comment or anything pointing to that?
And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Slower is usually better for thinking.
Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.
Your doomerism and superiority doesn't follow from your initial "I like many hackers don't like one size fits all".
This is literally offering you MANY sizes and you have the freedom to choose. Somehow you're pretending pushed down uniformity.
Consume it however you want and come up with actual criticisms next time?
Audiobooks before speed tools were the worst (are they trying to speak extra slow?) But when I can speed things up comprehension is just fine.
"This specific knowledge format doesnt work for me, so I'm asking OpenAI to convert this knowledge into a format that is easier for me to digest" is exactly what this is about.
I'm not quite sure what you're upset about? Unless you're referring to "one size fits all knowledge" as simplified topics, so you can tackle things at a surface level? I love having surface level knowledge about a LOT of things. I certainly don't have time to have go deep on every topic out there. But if this is a topic I find I am interested in, the full talk is still available.
Breadth and depth are both important, and well summarized talks are important for breadth, but not helpful at all for depth, and that's ok.
There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.
In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!
++ to "Slower is usually better for thinking"
By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.
Yeah, I see people talking about listening to podcasts or audiobooks on 2x or 3x.
Sometimes I set mine to 0.8x. I find you get time to absorb and think. Am I an outlier?
also means the longer you talk, the more you pay even if the actual info density is the same. so if your voice has longer pauses or you speak slow, you maybe subsidizing inefficiency.
makes me think maybe the next big compression is in delivery cadence. just auto-optimize voice tone and pacing before sending it to LLM. feed it synthetic fast speech with no emotion, just high density words. you lose human warmth but gain 40% cost savings
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort
This kind of transformation has always come with flaws, and I think that will continue to be expected implicitly. Far more worrying is the public's trust in _interpretations_ and claims of _fact_ produced by gen AI services, or at least the popular idea that "AI" is more trustworthy/unbiased than humans, journalists, experts, etc.
In the idea of making more of an OpenAI minute, don't send it any silence.
E.g.
ffmpeg -i video-audio.m4a \
-af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
apad=pad_dur=0.02" \
-c:a aac -b:a 128k output_minpause.m4a -y
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!
I now think this might be a good solution:
ffmpeg -i video-audio.m4a \
-af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
-c:a aac -b:a 128k output.m4a -y
I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).
Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.
Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.
I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?
javascript:void%20function(){document.querySelector(%22video,audio%22).playbackRate=parseFloat(prompt(%22Set%20the%20playback rate%22))}();
Could use an “auctioneer” voice to playback text at 10x speed.
I understand 4-6x speakers fairly well but don't enjoy listening at that pace. If I lose focus for a couple of seconds I effectively miss a paragraph of context and my brain can't fill in the missing details.
Transcribe it locally using whisper and output tokens/sec?
Hilbert transform and FFT to get phoneme rate would work.
guys how hard is it to toss both versions into like diffchecker or something haha youre just comparing text
ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt yuv420p \ -shortest \ file_you_upload_to_youtube_for_free_transcripts.mp4
This works VERY well for my needs.
I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?
I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!
https://github.com/jdepoix/youtube-transcript-api
For our internal tool that transcribes local city council meetings on YouTube (often 1-3 hours long), we found that these automatic ones were never available though.
(Our tool usually 'processes' the videos within ~5-30 mins of being uploaded, so that's also why none are probably available 'officially' yet.)
So we use yt-dlp to download the highest quality audio and then process them with whisper via Groq, which is way cheaper (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's API.) Sometimes groq errors out so there's built-in support for Replicate and Deepgram as well.
We run yt-dlp on our remote Linode server and I have a Python script I created that will automatically login to YouTube with a "clean" account and extract the proper cookies.txt file, and we also generate a 'po token' using another tool:
https://github.com/iv-org/youtube-trusted-session-generator
Both cookies.txt and the "po token" get passed to yt-dlp when running on the Linode server and I haven't had to re-generate anything in over a month. Runs smoothly every day.
(Note that I don't use cookies/po_token when running locally at home, it usually works fine there.)
It's frustrating to have to jump through all these hoops just to extract transcripts when the YouTube Data API already gives reasonable limits to free API calls ... would be nice if they allowed transcripts too.
Do you think the various YouTube transcript extractor services all follow a similar method as yours?
With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.
> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube
Doesn't YouTube do this for you automatically these days within a day or so?
Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:
https://github.com/jdepoix/youtube-transcript-api
The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.
Probably would be a good idea to add a delay to it and wait for the automatic ones though :)
At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.
Last time I checked, I think the Google auto-captions were noticeably worse quality than whisper, but maybe that has changed.
https://developers.cloudflare.com/workers-ai/models/whisper-...
I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.
The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
[0] https://github.com/huggingface/transformers.js/tree/main/exa...
Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.
;)
I have been thinking for a while how do you make good use of the short space in those places.
LLM did well here.
(Thanks for your good sense of humor)
It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.
My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.
Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.
LLMs as the operating system, the way you interface with vibe-coding (smaller chunks) and the idea that maybe we haven't found the "GUI for AI" yet are all things I've pondered and discussed with people. You articulated them well.
I think some formats, like a talk, don't lend themselves easily to meaningful summaries. It's about giving the audience things to think about, to your point. It's the sum of storytelling that's more than the whole and why we still do it.
My post is, at the end of the day, really more about a neat trick to optimize transcriptions. This particular video might be a great example of why you may not always want to do that :)
Anyway, thanks for the time and thanks for the talk!
Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.
If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.
georgemandis•7h ago
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.
bravesoul2•5h ago
behnamoh•4h ago
4b11b4•4h ago
hn8726•3h ago
ilyakaminsky•1h ago
[1] https://speechischeap.com