Motivation: I love podcasts, especially multi hour ones that go into detail on niche topics. One thing that puts me off some podcasts is having the flow become interrupted, especially mid sentence by dynamically inserted ads. Last year this led me down a rabbit hole of experimenting with removing ads from the podcasts I listen to.
Experimentation: At first I tried using Whisper for ad detection to generate a transcript of an episode, then feed this to ChatGPT and ask it to find the ad timestamps. This worked surprisingly well. I had a proof of concept working but I wanted something I could actually use on my phone.
Productionization: Next I wondered if I were to productionize my prototype how would I do that? From experimenting the main issue would be volume of audio transcription required to satisfy a moderate-heavy podcast listener. I estimated ~100 hours of audio per month would be required per user. In testing I used OpenAI hosted Whisper, charged at $0.006/min. That sounds quite cheap. What would that be for 100 hours? $0.006/min -> $0.36/hour -> $36.00/100 hours. S%#t. $36/user/month is way too expensive. If you were to turn this into a business at that rate you'd probably need to be charging at least $50/month. No one's going to pay that.
What if we did everything on device? In iOS 26 there are APIs for on device speech LLM and speech to text. I got the podcast audio -> transcript -> LLM detected ad segment pipeline working. Excellent! The next problem is that an iPhone is not a data center grade GPU. The pipeline was significantly slower than my first attempt. Before it would take <= ~4 mins while the iPhone pipeline could take up to 10 minutes for multi hour podcasts. The on device approach would be too slow of a good UX. Not to mention each time I ran a test run of the iPhone based pipeline my phone would get really hot and be a huge drain on the battery.
Back to square one. The only other approach (at least that I could think of) would be to manage the transcription infra myself. Given this is just a side project I wanted simple infra. Ideally I would be able to use something like AWS Lambda with GPUs (does not exist, I checked). My research showed GCP has serverless Cloud Run with a GPU option. Now we were starting to cook. I built a spike with GCP and had the ad detection working. As I was starting to get excited, I ran a load test on Cloud Run revealing a new problem.
GPUs are in hot demand. Who knew? GCP is (or at least was) limiting the number of GPUs per customer. I was only allotted ~3 GPUs to my account (I tried raising a support ticket for a higher limit but no luck). This was a huge bottleneck as transcribing an episode would saturate the resources of one GPU so 3 GPUs is only a pitiful 3 concurrent episodes being transcribed at once :(
Further into the rabbit hole, research led me to find Runpod that has a serverless GPUs. The low end GPUs go for ~$0.50/hour (that's hour of GPU time not audio transcribed) depending on the GPU used. Now with more reliable access to enough GPUs I could run a load test again. It worked out to be ~$0.02/hour or ~$2.00/100 hours of audio transcribed. At $2 per user per month this is looking a lot more reasonable. $2 is a 94% decrease compared to using the OpenAI API at $36. To be fair to OpenAI the transcripts I would get from their API would be more accurate. When tuning the Runpod implementation I was optimizing for speed and low cost. For the ad detection I found if the transcript was a bit less accurate this did not matter too much when getting the LLM to pick out the ad segments, so trading accuracy for speed + cost made sense here.
Anyway that is my story of building the Octopoddy ad detection pipeline. Please try it out, I'd love to hear what you think. I'd be happy to provide more details on any of this in the comments if you'd like :)