The dawn of a world simulator

https://odyssey.ml/the-dawn-of-a-world-simulator

80•olivercameron•1mo ago

Comments

pedalpete•1mo ago

This looks interesting, but can someone explain to me how this is different from video generators using the previous frames as inputs to expand on the next frame?

Is this more than recursive video? If so, how?

smusamashah•1mo ago

See the demo on their homepage. Calling it a world simulator is a marketing gimmick. It's a worse video generator but you can interact with it in real time and direct the video a little bit. Next version of this thing will be worth looking, this one isnt.

nowittyusername•1mo ago

There is soo much marketing bs around these things it drives me nuts. and it doesn't help that the large labs and credible individuals like denis use these terms. "world models" are video generator with contextual memory but that term is soo misplaced. when one thinks of a "world model" you expect the thing to be at least be physics engine driven from its foundation, not the other way around where everything is generated and assumed at best.

netsharc•1mo ago

It's plato's cave, but in color! https://en.wikipedia.org/wiki/Allegory_of_the_cave / https://www.youtube.com/watch?v=1RWOpQXTltA&t=56s

Animats•1mo ago

> Calling it a world simulator is a marketing gimmick.

Yes, it should be called an AI Metaverse.

It does do a nice job of short term prediction. That's useful as a component of common sense.

vrighter•1mo ago

why would you assume anything about "the next version"?

smusamashah•1mo ago

Based it on other video models, all the ones I have seen keep improving. This one should too. Infact, Google is doing it already with their Genie (IIRC). That one is high quality and interactive.

superb_dev•1mo ago

None of these examples videos seem like the kind of “experiments” that they’re talking about simulating with these models.

I was expecting them to test a simple hypothesis and compare the model results to a real world test

ainiriand•1mo ago

It is not a world simulator, looks like a world fantasy.

rmnclmnt•1mo ago

For a minute I was like (spoiler alert) « wow the creepy sci-fi theories from the DEVS tv show is taking place »… then I looked up the video and that’s just video generation at this point

qingcharles•1mo ago

That's where this is headed, though. That's the end game.

rmnclmnt•1mo ago

This should be interesting then: we’ll finally be able to assert whether time is deterministic and the future and past can be modelled/predicted (if you’ve seen the show you know what I mean)

jaggederest•1mo ago

I think that's actually already provably false if you're bloody-minded enough. I think the proof lies somewhere like Cantor's diagonalization but applied to reality, something like "if you could produce a model sufficiently complex enough to model the future perfectly it wouldn't fit into this current reality because it would require more than this reality's information"

I'm not saying it couldn't be locally violated, but it seems straightforward philosophically that each nesting doll of simulated reality must be imperfect by being less complicated.

nylonstrung•1mo ago

I can't wait for companies like this to run out of money

anigbrowl•1mo ago

This appears to be a simulator that produces only nice things.

01HNNWZ0MV43FF•1mo ago

Only SFW, too

arminiusreturns•1mo ago

I'm doing a metasim in full 3D with physics, I just keep seeing the limitations of the video format too much, but it is amazing when done right. The other biggest concern is licensing of output.

godelski•1mo ago

As a machine learning researcher, I don't get why these are called world models.

Visually, they are stunning. But it's nowhere near physical. I mean look at that video with the girl and lion. The tail teleports between legs and then becomes attached to the girl instead of the tiger.

Just because the visuals are high quality doesn't mean it's a world model or has learned physics. I feel like we're conflating these things. I'm much happier to call something a world model if its visual quality is dogshit but it is consistent with its world. And I say its world because it doesn't need to be consistent with ours

nurettin•1mo ago

> Visually, they are stunning.

The input images are stunning, model's result is another disappointing trip to uncanny valley. But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound. That is the world model.

godelski•1mo ago

  > But we feel Ok as long as the sequence doesn't horribly contradict the original image or sound.

Is the error I pointed out not "horribly contradicting"?

  > That is the world model.

I would say that if it is non-physical[0] then it's hard to call it a /world/ model. A world is consistent and has a set of rules that must be followed.

I've yet to see a claimed world model that actually captures this behavior. Yet it's something every game engine[1] gets very well. We'd call it a bad physics engine if they made the same mistakes we see even the most advanced "world models" do.

This is part of why I'm trying to explain that visual quality is actually orthogonal. Even old Atari games have consistent world models despite being pixelated. Or think about Mario on the original NES. Even the physics breaking in that game are more edge cases and not the norm. But here, things like the lion's tail is not consistent even to a 2D world. I've never bought the explanation that teleporting in front of and behind the leg is an artifact of embedding 3D into 2D[2] because the issue is actually the model not understanding collision and occlusion. It does not understand how the sections relate to one another in the image.

The major problem with these systems is that they just hope that the physics is recovered through enough examples of videos. Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that. It took a long time to develop physics due to these specific limitations. These models don't even have the advantage of being able to interact with the environment. They have no mechanisms to form beliefs and certainly no means to test them. It's essentially impossible to develop physics through observation alone

[0] with respect to the physics of the world being simulated. I want you distinguish real world physics from /a physics/

[1] a game physics engine is a world model. Which, as in stressing in [0], does not necessarily need follow real world physics. Mistakes happen of course but things are generally consistent.

[2] no video and almost no game is purely 2D. They tend to have backgrounds which places some layering but we'll say 2D for convenience and since we have a shared understanding

kgeist•1mo ago

>A world is consistent and has a set of rules that must be followed.

Large language models are mostly consistent, but they have mistakes even in grammar too, from time to time. And it's usually called a "hallucination". Can't we say physics errors are a kind of "hallucination" too, in a world model? I guess the question is, what hallucination rate are we willing to tolerate.

godelski•1mo ago

It's not about making no mistakes, it's about the categorical type of mistakes.

Let's consider language as a world, in some abstract sense. Lies may (or may not) be consistent here. Do they make sense linguistically? But then think about the category of errors where they start mixing languages and sound entirely nonsensical. That's rare with current LLMs in standard usage but you can still get them to have full on meltdowns.

This is the class of mistakes these models are making, not the failing to recite truth class of mistakes.

(Not a perfect translation but I hope this explanation helps)

nurettin•1mo ago

> Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that.

I studied enough physics to get a mech. eng. diploma. And I still understand the naivete. Observational physics can be derived with ml, and I have derived them, but not with neural nets. Or if you do it with neural nets, you can't alpha zero it, you need to cheat.

throwup238•1mo ago

> The major problem with these systems is that they just hope that the physics is recovered through enough examples of videos. Yet if one studied physics (beyond your basic college courses) you'd understand the naïveté of that. It took a long time to develop physics due to these specific limitations. These models don't even have the advantage of being able to interact with the environment. They have no mechanisms to form beliefs and certainly no means to test them. It's essentially impossible to develop physics through observation alone

Sounds like these world models are speed running from Platonic ideals to empiricism.

IAmGraydon•1mo ago

>As a machine learning researcher, I don't get why these are called world models.

It's called "world models" because it's a grift. An out-in-the-open, shameless grift. Investors, pile on.

godelski•1mo ago

I'm just trying to be a bit more political as it can be hard to communicate the issues. My first degree is actually in physics and I'll just say... over there "world model" implies something very different.

Edit: I said a bit more in the reply to the sibling comment. But we're probably on a similar page.

maplethorpe•1mo ago

The tail teleports and reattaches because that is the sort of thing that happens in this special AI world. Even though it looks like a bug, it's actually a physical process being modelled accurately.

godelski•1mo ago

I'll remind you I am a ML researcher.

So, you need to say more. Or at least give me some reason to believe you rather than state something as an objective truth and "just trust me". In the long response to a sibling I state more precisely why I have never bought this common conjecture. Because that's what it is, conjecture.

So give me at least some reason to believe you. Because you have neither logos nor ethos. Your answer is in the form of ethos, but without the critical requisites.

keithluu•1mo ago

I think they're joking.

godelski•1mo ago

If so, I misread and sorry. Their sarcasm is too on point, mimicking claims I've heard made in earnest.

KaiserPro•1mo ago

I think the issue is that "world models" are poorly defined.

With this kind of image gen, you can sorta plan robot interactions, but its super slow. I need to find the paper that deepmind produced, but basically they took the current camera input, used a text prompt like "robot arm picks up the ball", the video generated the arm motion, then the robot arm moved as it did in the video.

The problem is that its not really a world model, its just image gen. Its not like the model outputs a simulation that you can interact with (without generating more video) Its not like it creates a bunch of rough geo that you can then run physics on (ie you imagine a setup, draw it out and then run calcs on it.)

There is lots of work on making splats editable and semantically labeled, but again thats not like you can run physics on them so simulation is still very expensive. Also the properties are dependent on running the "world model" rather than querying the output at a point in time

godelski•1mo ago

  > poorly defined.

Poorly defined is not the same as undefined. There are bounds and we have a decent understanding of what this means. Not having the details all worked out is not the same. Though that lack of precision is being used to get away with more slop.

  > I need to find the paper that deepmind produced

I've seen that paper and the results pretty close to the action. I've even personally talked with people that worked on that paper. It very frequently "forgets" what is outside its view and it very frequently performs non-physically consistent actions. When you evaluate those models don't just try standard things, do weird things. Like keep trying to extend the grabber arm and it shouldn't jump to other parts of the screen.

  > The problem is that its not really a world model, its just image gen.

Yes, that was my point. Since you agree I'm not sure why you're disagreeing.

KaiserPro•1mo ago

I don't think I'm disagreeing, just adding more colour.

> It very frequently "forgets" what is outside its view

This was the observations that I saw when we were testing it. My former lab was late to pivoting to robotics, so we were looking at the current state of play to see what machine perception stuff is out there for robotics.

godelski•1mo ago

Ah, thanks for the clarification. It can be hard to interpret on these forums sometimes.

mycall•1mo ago

Have you looked at Titan and MIRAS where they use online/updating associative memory that happens to be read out via next-token prediction?

https://research.google/blog/titans-miras-helping-ai-have-lo...

https://arxiv.org/abs/2501.00663

https://arxiv.org/pdf/2504.13173

Much research is going into these directions, but I'm more interested in mind-wandering tangents, involving both attentional control and additional mechanisms (memory retrieval, self-referential processing).

KaiserPro•1mo ago

Memory in world models is interesting. But I think the main issue is that its holding everything in pixel space (its not, but it feels like that) rather than concept space. Thats why its hard for it to synthesise consistently.

However I am not qualified really to make that assertion.

andy12_•1mo ago

You just have to extrapolate the improvements in consistency in image model from the last couple of years and apply it to these kinds of video models. When in a couple of years they can generate consistent videos of many physical phenomena such that they are nearly indistinguishably from reality, you'll se why they are called "world models".

nl•1mo ago

The reason they are called "world models" is because the internal representation of what they display represents a "world" instead of a video frame or image. The model needs to "understand" geometry and physics to output a video.

Just because there are errors in this doesn't mean it isn't significant. If a machine learning model understands how physical objects interact with each other that is very useful.

PunchyHamster•1mo ago

I think the reason is "those words look nice on promo material". It is absolutely build to trigger hype from the clueless

godelski•1mo ago

  > what they display represents a "world" instead of a video frame or image.

Do they?

I'm unconvinced. The tiger and girl video is the clearest example. Nothing about that seems world representing

slashdave•1mo ago

> The model needs to "understand" geometry and physics to output a video.

No it doesn't. It merely needs to mimic.

IAmGraydon•1mo ago

Correct. The fact that AI is a black box means we can easily imagine anything we want happening within that box. Or perhaps the more accurate way to say it - AI companies can convince investors of amazing magic happening within that box. With LLMs, we anthropomorphize and imagine it’s thinking. With video models, they’re now trying to convince us that it understands the world. None of these things are true. It’s all an illusion.

slashdave•1mo ago

It's worse than that. It's not a black box. We know how the architecture is constructed. We can read the weights.

in-silico•1mo ago

Here's a recent paper showing that models trained to generate videos develop strong geometric representations and understanding:

https://arxiv.org/abs/2512.19949

LarsDu88•1mo ago

I feel like there's a bit if a disconnect with the cool video demos demonstrated here and say, the type of world models someone like Yann Lecunn is talking about.

A proper world model like Jepa should be predicting in latent space where the representation of what is going on is highly abstract.

Video generation models by definition are either predicting in noise or pixel space (latent noise if the diffuser is diffusing in a variational encoders latent space)

It seems like what this lab is doing is quite vanilla, and I'm wondering if they are doing any sort of research in less demo sexy joint embedding predictive spaces.

There was a recent paper, LeJepa from LeCunn and a postdoc that actually fixes many of the mode distribution collapse issues with the Jepa embedding models I just mentioned.

I'm waiting on the startup or research group that gives us an unsexy world model. Instead of giving us 1080p video of supermodels camping, gives us a slideshow of something a 6 year old child would draw. That would be a more convincing demonstrator of an effective world model.

blueblisters•1mo ago

Dreamer4 (https://danijar.com/project/dreamer4/) is a promising direction (by a frontier lab)

jstanley•1mo ago

> Video generation models by definition are either predicting in noise or pixel space

I don't see that this follows "by definition" at all.

Just because your output is pixel values doesn't mean your internal world model is in pixel space.

LarsDu88•1mo ago

You need to train a decoder either end to end or conditioned on latents.

In either case the impressiveness of that decoder can be far removed from the effectiveness of your world model or involve no world model at all

jstanley•1mo ago

Making convincing videos of the world without having a world model would be like writing convincing essays about computing without understanding computing.

LarsDu88•1mo ago

There's a few things to consider here: - there are many aspects to the video that are not convincing, indicating these videogen models do not grok the world the same way a typical human does - A 6 year old child is almost certainly incapable of recreating pixel level fidelity video footage, yet understands the world extremely well... far beyond what current robotics is capable of.

The two facts above should be indicative that predicting noise (as with DDPM diffusion models), or predicting pixel level (or even VAE latent "pixel") information is probably not the optimal path to world understanding. Probably not even a good path to good world models.

zkmon•1mo ago

Please AI - lions have their tail attached to their back, not front. The lion's tail in the video of Girl with a lion is misplaced.

alex1138•1mo ago

I guess this might be a chance to plug the fact that Matrix came up with their own Metaverse thing (for lack of a better word) called Third Room, it represented the rooms you joined as spaces/worlds, they built some limited functionality demos before the funding dried up

jyunth•1mo ago

Interesting. I imagine quite a few issues would seem to stem out of the inherent nature of generative AI, we even see several in these demos themselves. One particularly stood out to me, the one where the man is submerged, and for a good while bubbles come out quite consistently out of his mask, and then suddenly one of the bubbles turn into a jellyfish. At a specific frame, the AI thought it looked more like a jellyfish than a bubble and now the world has a jellyfish to deal with now.

It'll surely take a looot of video data, even more than what humans can possibly produce to build a normalized, euclidean, physics adherent world model. Data could be synthetically generated, checked thoroughly and fed to the training process, but at the end of the day it seems.... Wasteful. As if we're looking at a local optima point.

actionfromafar•1mo ago

The videos are like waking up from a dream, monstrous inexplicable details.

mvkel•1mo ago

Given the near-impossibility of predicting something as "simple" as a stock market due to its recursive nature, I'm not sure I see how it would be possible to simulate an infinitely more complicated "world"

Slint: Cross Platform UI Library

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

Moltbook isn't real but it can still hurt you

Take Back the Em Dash–and Your Voice

Show HN: 289x speedup over MLP using Spectral Graphs

Teaching Mathematics

3D Printed Microfluidic Multiplexing [video]

Abstractions Are in the Eye of the Beholder

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

We didn't ask for this internet – Ezra Klein show [video]

The Real AI Talent War Is for Plumbers and Electricians

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

I Maintain My Blog in the Age of Agents

The Fall of the Nerds

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

Slint: Cross Platform UI Library

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

Moltbook isn't real but it can still hurt you

Take Back the Em Dash–and Your Voice

Show HN: 289x speedup over MLP using Spectral Graphs

Teaching Mathematics

3D Printed Microfluidic Multiplexing [video]

Abstractions Are in the Eye of the Beholder

Show HN: Routed Attention – 75-99% savings by routing between O(N) and O(N²)

We didn't ask for this internet – Ezra Klein show [video]

The Real AI Talent War Is for Plumbers and Electricians

Show HN: MimiClaw, OpenClaw(Clawdbot)on $5 Chips

I Maintain My Blog in the Age of Agents

The Fall of the Nerds

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

How close is AI to taking my job?

You are the reason I am not reviewing this PR

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

How Meta Made Linux a Planet-Scale Load Balancer

A Turing Test for AI Coding

How to Identify and Eliminate Unused AWS Resources

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

CLI for Common Playwright Actions

Would you use an e-commerce platform that shares transaction fees with users?

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

The Evolution of the Interface

Azure: Virtual network routing appliance overview

Seedance2 – multi-shot AI video generation

The dawn of a world simulator

Comments