It uses the transformers library with a device-aware backend: it will prioritize CUDA, then MPS (for Mac users), and finally fallback to CPU. I've found that Qwen 2.5-1.5B provides a good balance between speed and summary quality for this specific task.
How it works:
- Extracts the transcript via yt-dlp. - Performs extractive compression if the text exceeds the context window. - Summarizes via local inference with streaming output.
I'd appreciate any feedback for optimization!