The pipeline is straightforward at a high level: transcribe episodes with faster_whisper running locally on an RTX 3060, run the text through GPT-5-mini to pull out structured book mentions, and store everything in Azure SQL and Blob storage. The frontend is a small Flask app using HTMX, Tailwind, and some D3 for the visualizations.
The part that turned out to be much more time consuming than expected was deduping. Everything else scaled nicely, but normalizing book titles is still the one piece I can’t fully automate without quality drifting. Fuzzy matching gets you most of the way, but the long tail of book names is huge. I ended up building a tiny internal Flask UI just to confirm or split fuzzy matches by hand, it also lets me review the context for the book mention to ensure accuracy. It's the only place in the system where a human is still in the loop.
A few other unexpected issues came up: some podcast RSS feeds randomly duplicate or link to broken episodes, CUDA can crash if I’m not careful with garbage collection between Whisper runs, and LLM extraction occasionally fails if the model doesn’t return exactly the JSON shape I expect.
One surprising pattern emerged: the long tail is enormous. A handful of books are mentioned constantly, but thousands more appear exactly once.
If you want to see the current state of it, the reports and visualizations are here: https://www.mavensignal.com
Happy to answer anything about the pipeline, LLM prompting, dedupe logic, or the stack in general.