I spent the last few weeks exploring whether AI systems could benefit from generating video predictions before making decisions—like how humans mentally simulate "what happens if I pour this coffee?" before acting.
The idea: Show an AI an image, ask "what happens if I push this?", have it generate a video prediction, then compare that prediction to reality. If the prediction looks wrong, maybe the AI could catch its own mistakes.
The result: Current models can't do this. But I learned some interesting things along the way.
What I tested:
- 7 different architectures for predicting future video frames from VLM latent space
- Whether perceptual similarity (LPIPS) between predicted and actual video correlates with correctness
- Self-correction loops where the model gets feedback on its predictions
Key findings:
1. VLMs can't predict the future – Every architecture I tried performed worse than just copying the current frame as the "prediction." The model understands what's in an image but can't predict what will change.
2. Visual similarity ≠ semantic correctness – This one surprised me. Wrong predictions often looked MORE similar to reality than correct ones (LPIPS correlation: 0.106). You can't use "does it look right?" to catch mistakes.
3. Some things worked – Hybrid encoders (DINOv2 + VLM) preserve spatial information that VLMs lose. VLMs understand generated video well (93% semantic retention). Small adapters (10M params) work better than large ones (100M).
I'm releasing this as a benchmark proposal. Video generation is improving fast—capabilities that don't exist today might emerge in future models. Seems worth tracking.
Links:
- Demo video: https://youtu.be/YJxDt_zCrUI
- Code + paper: https://github.com/a1j9o94/foresight
- Live demo: https://foresight-demo-kappa.vercel.app
Built with Qwen2.5-VL, LTX-Video, Modal (GPUs), and the Something-Something v2 dataset.
Happy to answer questions about the experiments or methodology.
a1j9o94•1h ago
seg_lol•52m ago
a1j9o94•28m ago