I’m Ashu, founder of VideoDB. I’ve spent a big chunk of my life building video infrastructure. Not video creation. Video plumbing.
The stuff you only learn after production breaks: timebases, VFR, keyframes, audio sync drift, container quirks, partial uploads, live streams, retries, backpressure, codecs, ffmpeg flags, cost blowups, and “why is this clip unseekable on one player but fine on another”.
This week we shared VideoDB Skills, a skill pack that lets AI agents call those infra primitives directly, instead of you wiring pipelines with screenshots plus FFmpeg glue.
Repo: https://github.com/video-db/skills
What it enables (infra level):
- Ingest videos and live streams
- Index and search moments
- Return playable evidence links
- Run server side edits and transforms
- Trigger automations from video events
Why this matters for agents:
Agents can reason, write code, browse. But continuous media is still mostly invisible. In an agentic world, perception needs to be a first class interface, not a manual workflow.
Try it quickly:
npx skills add video-db/skills
Then inside your agent: /videodb setup
A few prompts to test:
1. “Upload this URL and give me a playable stream link”
2. “Search this folder for scenes with <keyword> and return clips”
3. “Capture my screen for 2 minutes and give me a structured summary”
4. “Monitor this RTSP feed and log events with timestamps”
What I’m looking for from HN:
1. Does this feel like the right abstraction layer for perception in agent stacks?
2. What would you consider the minimum viable “perception API”?
3. Where do you think this fails in the real world, latency, cost, privacy, reliability?
If you try it and it breaks, tell me the agent, OS, and the error output. I’ll fix it.