I Tried to Give AI "Imagination" to Solve Physics Problems

2•a1j9o94•1w ago

Comments

a1j9o94•1w ago

Hey HN,

  I spent the last few weeks exploring whether AI systems could benefit from generating video predictions before making decisions—like how humans mentally simulate "what happens if I pour this coffee?" before acting.

  The idea: Show an AI an image, ask "what happens if I push this?", have it generate a video prediction, then compare that prediction to reality. If the prediction looks wrong, maybe the AI could catch its own mistakes.

  The result: Current models can't do this. But I learned some interesting things along the way.

  What I tested:
  - 7 different architectures for predicting future video frames from VLM latent space
  - Whether perceptual similarity (LPIPS) between predicted and actual video correlates with correctness
  - Self-correction loops where the model gets feedback on its predictions

  Key findings:

  1. VLMs can't predict the future – Every architecture I tried performed worse than just copying the current frame as the "prediction." The model understands what's in an image but can't predict what will change.
  2. Visual similarity ≠ semantic correctness – This one surprised me. Wrong predictions often looked MORE similar to reality than correct ones (LPIPS correlation: 0.106). You can't use "does it look right?" to catch mistakes.
  3. Some things worked – Hybrid encoders (DINOv2 + VLM) preserve spatial information that VLMs lose. VLMs understand generated video well (93% semantic retention). Small adapters (10M params) work better than large ones (100M).

  I'm releasing this as a benchmark proposal. Video generation is improving fast—capabilities that don't exist today might emerge in future models. Seems worth tracking.

  Links:
  - Demo video: https://youtu.be/YJxDt_zCrUI
  - Code + paper: https://github.com/a1j9o94/foresight
  - Live demo: https://foresight-demo-kappa.vercel.app

  Built with Qwen2.5-VL, LTX-Video, Modal (GPUs), and the Something-Something v2 dataset.

  Happy to answer questions about the experiments or methodology.

seg_lol•1w ago

Why is the demo video not in your readme?

a1j9o94•1w ago

Honestly just didn't think about it. Added it.

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

Arcan Explained: A browser for different webs

What did we learn from the AI Village in 2025?

An open replacement for the IBM 3174 Establishment Controller

The P in PGP isn't for pain: encrypting emails in the browser

Show HN: Mirror Parliament where users vote on top of politicians and draft laws

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

We Mourn Our Craft

Jim Fan calls pixels the ultimate motor controller

Exploring a Modern SMTPE 2110 Broadcast Truck with My Dad

AI UX Playground: Real-world examples of AI interaction design

The Field Guide to Design Futures

The Other Leverage in Software and AI

AUR malware scanner written in Rust

Free FFmpeg API [video]

Are AI agents ready for the workplace? A new benchmark raises doubts

Show HN: AI Watermark and Stego Scanner

Clarity vs. complexity: the invisible work of subtraction

Solid-State Freezer Needs No Refrigerants

Ask HN: Will LLMs/AI Decrease Human Intelligence and Make Expertise a Commodity?

From Zero to Hero: A Brief Introduction to Spring Boot

NSA detected phone call between foreign intelligence and person close to Trump

How to Fake a Robotics Result

It's time for the world to boycott the US

Show HN: Semantic Search for terminal commands in the Browser (No Back end)

The AI CEO Experiment

Speed up responses with fast mode

MS-DOS game copy protection and cracks

Updates on GNU/Hurd progress [video]

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

Arcan Explained: A browser for different webs

What did we learn from the AI Village in 2025?

An open replacement for the IBM 3174 Establishment Controller

The P in PGP isn't for pain: encrypting emails in the browser

Show HN: Mirror Parliament where users vote on top of politicians and draft laws

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

We Mourn Our Craft

Jim Fan calls pixels the ultimate motor controller

Exploring a Modern SMTPE 2110 Broadcast Truck with My Dad

AI UX Playground: Real-world examples of AI interaction design

The Field Guide to Design Futures

The Other Leverage in Software and AI

AUR malware scanner written in Rust

Free FFmpeg API [video]

Are AI agents ready for the workplace? A new benchmark raises doubts

Show HN: AI Watermark and Stego Scanner

Clarity vs. complexity: the invisible work of subtraction

Solid-State Freezer Needs No Refrigerants

Ask HN: Will LLMs/AI Decrease Human Intelligence and Make Expertise a Commodity?

From Zero to Hero: A Brief Introduction to Spring Boot

NSA detected phone call between foreign intelligence and person close to Trump

How to Fake a Robotics Result

It's time for the world to boycott the US

Show HN: Semantic Search for terminal commands in the Browser (No Back end)

The AI CEO Experiment

Speed up responses with fast mode

MS-DOS game copy protection and cracks

Updates on GNU/Hurd progress [video]

I Tried to Give AI "Imagination" to Solve Physics Problems

Comments