OpenAI Charges by the Minute, So Make the Minutes Shorter

https://george.mand.is/2025/06/openai-charges-by-the-minute-so-make-the-minutes-shorter/

213•georgemandis•4h ago

Comments

georgemandis•4h ago

I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.

Felt like a fun trick worth sharing. There’s a full script and cost breakdown.

bravesoul2•2h ago

You could have kept quiet and started a cheaper than openai transcription business :)

behnamoh•1h ago

Sure, but now the world is a better place because he shared something useful!

4b11b4•1h ago

Pre-processing of the audio still a valid biz, multiple types of pre-processing might be valid

ada1981•2h ago

We discovered this last month.

There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.

moralestapia•2h ago

>We discovered this last month.

Nice. Any blog post, twitter comment or anything pointing to that?

babuloseo•1h ago

source?

brendanfinan•2h ago

would this also work for my video consisting of 10,000 PDFs?

https://news.ycombinator.com/item?id=44125598

jasonjmcghee•2h ago

I can't tell if this is a meme or not.

And if someone had this idea and pitched it to Claude (the model this project was vibe coded with) it would be like "what a great idea!"

mcc1ane•2h ago

Longer*

simonw•2h ago

There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!

heeton•2h ago

A point on skimming vs taking the time to read something properly.

I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.

Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.

This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.

Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.

Slower is usually better for thinking.

pluc•2h ago

Seriously this is bonkers to me. I, like many hackers, hated school because they just threw one-size-fits-all knowledge at you and here we are, paying for the privilege to have that in every facet of our lives.

Reading is a pleasure. Watching a lecture or a talk and feeling the pieces fall into place is great. Having your brain work out the meaning of things is surely something that defines us as a species. We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

hooverd•1h ago

If you're not listening to summaries of different audiobooks at 2x speed in each ear you're not contentmaxing.

isaacremuant•35m ago

> We're willingly heading for such stupidity, I don't get it. I don't get how we can all be so blind at what this is going to create.

Your doomerism and superiority doesn't follow from your initial "I like many hackers don't like one size fits all".

This is literally offering you MANY sizes and you have the freedom to choose. Somehow you're pretending pushed down uniformity.

Consume it however you want and come up with actual criticisms next time?

georgemandis•1h ago

For what it's worth, I completely agree with you, for all the reasons you're saying. With talks in particular I think it's seldom about the raw content and ideas presented and more about the ancillary ideas they provoke and inspire, like you're describing.

There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.

In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!

++ to "Slower is usually better for thinking"

mutagen•53m ago

Not to discount slower speeds for thinking but I wonder if there is also value in dipping into a talk or a subject and then revisiting (re-watching) with the time to ponder on the thoughts a little more deeply.

tass•40m ago

This is similar to strategies in “how to read a book” (Adler).

By understanding the outline and themes of a book (or lecture, I suppose), it makes it easier to piece together thoughts as you delve deeper into the full content.

conradev•44m ago

Was it the speed or the additional information vended by the audio and video? If someone is a compelling speaker, the same message will be way more effective in an audiovisual format. The audio has emphasis on certain parts of the content, for example, which is missing from the transcript or summary entirely. Video has gestural and facial cues, also often utilized to make a point.

bongodongobob•7m ago

You'd love where I work. Everything is needlessly long bloviating power point meetings that could easily be ingested in a 5 minute email.

b0a04gl•2h ago

it's still decoding every frame and matching phonemes either way, but speeding it up reduces how many seconds they bill you for. so you may hack their billing logic more than the model itself.

also means the longer you talk, the more you pay even if the actual info density is the same. so if your voice has longer pauses or you speak slow, you maybe subsidizing inefficiency.

makes me think maybe the next big compression is in delivery cadence. just auto-optimize voice tone and pacing before sending it to LLM. feed it synthetic fast speech with no emotion, just high density words. you lose human warmth but gain 40% cost savings

timerol•2h ago

> Is It Accurate?

> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.

This is a great bit of work, and the author accurately summarizes my discomfort

jasonjmcghee•2h ago

Heads up, the token cost breakdown tables look white on white to me. I'm in dark mode on iOS using Brave.

georgemandis•1h ago

Should be fixed now. Thank you!

w-m•2h ago

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

In the idea of making more of an OpenAI minute, don't send it any silence.

E.g.

    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y

will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

georgemandis•1h ago

Oooh fun! I had a feeling there was more ffmpeg wizardry I could be leaning into here. I'll have to try this later—thanks for the idea!

w-m•1h ago

In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove

Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!

I now think this might be a good solution:

    ffmpeg -i video-audio.m4a \
           -af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
           -c:a aac -b:a 128k output.m4a -y

snickerdoodle12•31m ago

I love ffmpeg but the documentation is often close to incomprehensible.

pragmatic•1h ago

No not really? The talk where he babbles about OSes and everyone is somehow impressed?

behnamoh•1h ago

> His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.

I wonder if there's a way to automatically detect how "fast" a person talks in an audio file. I know it's subjective and different people talk at different paces in an audio, but it'd be cool to kinda know when OP's trick fails (they mention x4 ruined the output; maybe for karpathy that would happen at x2).

echelon•1h ago

> I wonder if there's a way to automatically detect how "fast" a person talks in an audio file.

Stupid heuristic: take a segment of video, transcribe text, count number of words per utterance duration. If you need speaker diarization, handle speaker utterance durations independently. You can further slice, such as syllable count, etc.

nand4011•54m ago

https://www.science.org/doi/10.1126/sciadv.aaw2594

Apparently human language conveys information at around 39 bits/s. You could use a similar technique as that paper to determine the information rate of a speaker and then correct it to 39 bits/s by changing the speed of the video.

varispeed•33m ago

It's a shame platforms don't generally support speeds greater than 2x. One of my "superpowers" or a curse is that I cannot stand normal speaking pace. When I watch lectures, I always go for maximum speed and that still is too slow for me. I wish platforms have included 4x but done properly (with minimal artefacts).

lofaszvanitt•29m ago

Robot in a human body identified :D.

mrmuagi•27m ago

All audiobooks are like this for me. I tried it for lectures but if I'm taking handwritten notes, I can't keep up my writing.

I wonder if there is negative side effects of this though, do you notice when interacting with people who speak slower require a greater deal of patience?

dpcx•21m ago

https://github.com/codebicycle/videospeed has been a wonderful addition for me.

btown•27m ago

Even a last-decade transcription model could be used to detect a rough number of syllables per unit time, and the accuracy of that model could be used to guide speed-up and dead-time detection before sending to a more expensive model. As with all things, it's a question of whether the cost savings justify the engineering work.

babuloseo•2h ago

I use the youtube trick, will share it here, but upload to youtube and use their built in transcription service to translate to text for you, and than use gemini pro 2.5 to rebuild the transcript.

ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt yuv420p \ -shortest \ file_you_upload_to_youtube_for_free_transcripts.mp4

This works VERY well for my needs.

KTibow•2h ago

This is really interesting, although the cheapest route is still to use an alternative audio-compatible LLM (Gemini 2.0 Flash Lite, Phi 4 Multimodal) or an alternative host for Whisper (Deepinfra, Fal).

fallinditch•1h ago

When extracting transcripts from YouTube videos, can anyone give advice on the best (cost effective, quick, accurate) way to do this?

I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?

I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!

vjerancrnjak•1h ago

If YouTube placed autogenerated captions you can download them free of charge with yt-dlp.

rob•58m ago

There's a tool that uses YouTube's unofficial APIs to get them if they're available:

https://github.com/jdepoix/youtube-transcript-api

For our internal tool that transcribes local city council meetings on YouTube (often 1-3 hours long), we found that these automatic ones were never available though.

(Our tool usually 'processes' the videos within ~5-30 mins of being uploaded, so that's also why none are probably available 'officially' yet.)

So we use yt-dlp to download the highest quality audio and then process them with whisper via Groq, which is way cheaper (~$0.02-0.04/hr with Groq compared to $0.36/hr via OpenAI's API.) Sometimes groq errors out so there's built-in support for Replicate and Deepgram as well.

We run yt-dlp on our remote Linode server and I have a Python script I created that will automatically login to YouTube with a "clean" account and extract the proper cookies.txt file, and we also generate a 'po token' using another tool:

https://github.com/iv-org/youtube-trusted-session-generator

Both cookies.txt and the "po token" get passed to yt-dlp when running on the Linode server and I haven't had to re-generate anything in over a month. Runs smoothly every day.

(Note that I don't use cookies/po_token when running locally at home, it usually works fine there.)

fallinditch•5m ago

Very useful, thanks. So does this mean that every month or so you have to create a new 'clean' YouTube account and use that to create new po_token/cookies?

It's frustrating to have to jump through all these hoops just to extract transcripts when the YouTube Data API already gives reasonable limits to free API calls ... would be nice if they allowed transcripts too.

Do you think the various YouTube transcript extractor services all follow a similar method as yours?

topaz0•1h ago

I have a way that is (all but) free -- just watch the video if you care about it, or decide not to if you don't, and move on with your life.

Tepix•1h ago

Why would you give up your privacy by sending what interests you to OpenAI when whisper doesn't need that much computer in the first place?

With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.

anigbrowl•6m ago

[delayed]

pimlottc•1h ago

Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!

rob•1h ago

For anybody trying to do this in bulk, instead of using OpenAI's whisper via their API, you can also use Groq [0] which is much cheaper:

[0] https://groq.com/pricing/

Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.

We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.

georgemandis•55m ago

Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.

> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube

Doesn't YouTube do this for you automatically these days within a day or so?

rob•43m ago

> Doesn't YouTube do this for you automatically these days within a day or so?

Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:

https://github.com/jdepoix/youtube-transcript-api

The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.

Probably would be a good idea to add a delay to it and wait for the automatic ones though :)

stogot•1h ago

Love this idea but the accuracy section is lacking. Couldnt you do a simple diff of the outputs and see how many differences there are? .5% or 5%?

georgemandis•58m ago

Yeah, I'd like to do a more formal analysis of the outputs if I can carve out the time.

I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.

The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.

tmaly•46m ago

The whisper model weights are free. You could save even more by just using them locally.

55555•37m ago

This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F

amelius•28m ago

Solution: charge by number of characters generated.

dataviz1000•19m ago

I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]

The last thing in the world I want to do is listen or watch presidential social media posts, but, on the hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.

[0] https://github.com/huggingface/transformers.js/tree/main/exa...

[1] https://github.com/adam-s/doomberg-terminal

karpathy•7m ago

Omg long post. TLDR from an LLM for anyone interested

Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.

;)

Asynchronous Functional Programming – Handling HTTP

Cloudflare Sandboxes

LM Studio is now an MCP Host

Getty drops key copyright claims against Stability AI, but UK lawsuit continues

Sam Altman takes his 'io' trademark battle public

iPhone 17 Pro: A closer look at the new 'camera bar' design

DeepSpeech Is Discontinued

Will matcha's popularity be its downfall?

Build and Host AI-Powered Apps with Claude – No Deployment Needed

America's Prison Population Is in Serious Decline

India forcibly sterilised 8M men: One village remembers, 50 years later

The World is a Fractal – a mental model for navigating the depth of knowledge

I Stopped Worrying About Costs and Learned to Love Kubernetes

Cloudflare/actors library – SDK for Durable Objects in beta

Microsoft opens a free tier for Windows 10 extended updates

Phantoscopes, Radiovision, and the Dawn of TV

Inworld TTS: 20x cheaper, state-of-the-art, text-to-speech

Claude Code Is Magic

What Problems to Solve – By Richard Feynman

Public Bet

We cataloged 200 ways you're wasting money in the cloud

Why Do So Many Parents Think Kids Need Their Own Bedroom?

Show HN: Mind blow by NotebookLM generating podcast on LLM Sparsity

Karpathy: "context engineering" over "prompt engineering"

At Amazon's Biggest Data Center, Everything Is Supersized for A.I

Frequent-Flyer Miles as Shadow Currency: How Inflation Is Built-In

The Backtester's Edge: How Code and AI Transform Your Strategy Game

GitHub CEO: manual coding remains key despite AI boom

Apple's push to take over the dashboard resisted by car makers

Mysteries of Plant "Intelligence"

OpenAI Charges by the Minute, So Make the Minutes Shorter

Comments

Asynchronous Functional Programming – Handling HTTP

Cloudflare Sandboxes

LM Studio is now an MCP Host

Getty drops key copyright claims against Stability AI, but UK lawsuit continues

Sam Altman takes his 'io' trademark battle public

iPhone 17 Pro: A closer look at the new 'camera bar' design

DeepSpeech Is Discontinued

Will matcha's popularity be its downfall?

Build and Host AI-Powered Apps with Claude – No Deployment Needed

America's Prison Population Is in Serious Decline

India forcibly sterilised 8M men: One village remembers, 50 years later

The World is a Fractal – a mental model for navigating the depth of knowledge

I Stopped Worrying About Costs and Learned to Love Kubernetes

Cloudflare/actors library – SDK for Durable Objects in beta

Microsoft opens a free tier for Windows 10 extended updates

Phantoscopes, Radiovision, and the Dawn of TV

Inworld TTS: 20x cheaper, state-of-the-art, text-to-speech

Claude Code Is Magic

What Problems to Solve – By Richard Feynman

Public Bet

We cataloged 200 ways you're wasting money in the cloud

Why Do So Many Parents Think Kids Need Their Own Bedroom?

Show HN: Mind blow by NotebookLM generating podcast on LLM Sparsity

Karpathy: "context engineering" over "prompt engineering"

At Amazon's Biggest Data Center, Everything Is Supersized for A.I

Frequent-Flyer Miles as Shadow Currency: How Inflation Is Built-In

The Backtester's Edge: How Code and AI Transform Your Strategy Game

GitHub CEO: manual coding remains key despite AI boom

Apple's push to take over the dashboard resisted by car makers

Mysteries of Plant "Intelligence"