It’s super interesting to me how the process of fully making audio/video searchable requires so much processing. Like, extracting the audio and video, transcribing the audio, chunking the video into 15-sec scenes and describing them visually, etc.
I wonder if as a test you could use the video descriptions, run them as a prompt through something like Veo, then stitch them together into something close to the original. Wild.
mkauffman23•6h ago
Here's a TLDR: - Built a full pipeline that processes audio/video → transcription + vision descriptions → chunking → indexing - Audio: faster-whisper with large-v3-turbo (4x faster than vanilla Whisper) - Video: Chose Vision LLM descriptions over native multimodal embeddings (2x faster, 6x cheaper, better results) 15-second video chunks hit the sweet spot for detail vs context - Source attribution with direct links to exact timestamps
Happy to answer any further questions folks might have!
bobremeika•6h ago