Reminds me of when image AIs weren't able to generate text. It wasn't too long until they fixed it.
What I’d really love to see more of is augmented video. Like, the stormtrooper vlogs. Runway has some good stuff but man is it all expensive.
Walking/Running/Steps have already been solved pretty well with NN’s, but simulation of vehicle engines and vehicle physics have not. Not to my knowledge. I suspect iRacing would be extremely interested in such a model.
edit
I take it back, PINN’s are a thing and now I have a new rabbit hole…
I don't think Humans are the target market for this model, at least right now.
Sounds like the use case is creating worlds for AI agents to play in.
I DECLARE BANKRUPTCY vibes here
It's a whitepaper release to share the STOTA research. This doesn't seem like an economically viable model, nor does it look polished enough to be practically usable.
We know how James Webb works and it's developed by an international consortium of researchers. One of our most trusted international institutions, and very verifiable.
We do not know how Genie works, it is unverifiable to non-Google researchers, and there are not enough technical details to move much external teams forward. Worst case, this page could be a total fabrication intended to derail competition by lying about what Google is _actually_ spending their time on.
We really don't know.
I don't say this to defend the other comment and say you're wrong, because I empathize with both points. But I do think that treating Google with total credulity would be a mistake, and the James Webb comparison is a disservice to the JW team.
I would actually turn that around. The Telescope is released. It's flying around up there taking photos. If they kept it in some garage while releasing flashy PR pages about how groundbreaking it is, then I'd be pretty skeptical.
The main product of the telescope is its data, not the ability for anyone to play with the instruments.
The main product of the model is the ability for anyone to play with it.
Strange rebutal.
…and this is the worst the capabilities will ever be.
Watching the video created a glimmer of doubt that perhaps my current reality is a future version of myself, or some other consciousness, that’s living its life in an AI hallucinated environment.
Personal jetpacks are the worst they’ll ever be. Doesn’t mean they’re any close to being useful.
Your comparison is incorrect
Not sure if that's what you are trying to say about AI, or not.
Have they become better over the past 20 years?
> …and this is the worst the capabilities will ever be.
I guess if this bothers you (and I can see how it might) you can take some small comfort in thinking that (due to enshitification) this could in fact be the _best_ the capabilities will ever be.
[1]: https://www.worldlabs.ai/
[3]: https://runwayml.com/research/introducing-general-world-mode...
- Google search
- Web browsers
- Web content
- Internet Explorer
- Music
- Flight process at Mosul airport
- Star Wars
(Also, implying that music has gotten worse is a boomer-ass take. It might not be to your liking, but there's more of it than ever before, and new sonic frontiers are being discovered every day.)
And then you watched Mandalorian and Andor?
Jokes aside, Google Search results are worse thanks to so much web content being just ad scaffolding, but the interesting one here is music.
Music is typically imagined to be its best at whatever ages one most listened to it, partly trained in and partly thanks to meanings/memories/nostalgia attached to it. As a consequence, for most everyone, more recent music seems to be “getting worse”!
That said, and back to the SEO effect on Google Results, I'd argue mass distribution/advertising/marketing has resulted in most audio airtime getting objectively* less complex, but if one turns off the mass distribution, and looks around, there seems to be plenty of just as good — even building on what came before — music to be found.
* https://www.researchgate.net/publication/387975100_Decoding_...
Due to this physical limitation, what you 'see' in front of you, widely accepted as ground truth reality, cannot possibly real, its a hallucination produced by your brain.
Your brain, compared to the sensory richness of reality you experience around you, has very limited direct inputs from the outside world, it must construct a rich internal model based on this.
It's very weird (at least to me), that the boundary between reality and assumption (basically educated guessing) is very arbitrary, and definitely only exists in our heads.
The next step is to realize that, if life is a cheap simulation, not everyone might have... uh... fully simulated minds. Player Characters vs NPCs is what gamers would say, though it doesn't have to be binary like that, and the term NPC has already been ruined by social media rants. (Also, NPC is a bad insult because most of the coolest characters in games are NPC rivals or bosses or whatnot.)
> Genie 3’s consistency is an emergent capability
So this just happened from scaling the model, rather than being a consequence of deliberate architecture changes?
Edit: here is some commentary on limitations from someone who tried it: https://x.com/tejasdkulkarni/status/1952737669894574264
> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...
Unbelievable. How is this not a miracle? So we're just stumbling onto breakthroughs?
It's basically what every major AI lab head is saying from the start. It's the peanut gallery that keeps saying they are lying to get funding.
We had one breakthrough a couple of years ago with GPT-3, where we found that neural networks / transformers + scale does wonders. Everything else has been a smooth continuous improvement. Compare today's announcement to Genie-2[1] release less than 1 year ago.
The speed is insane, but not surprising if you put in context on how fast AI is advancing. Again, nothing _new_. Just absurdly fast continuous progress.
[1] - https://deepmind.google/discover/blog/genie-2-a-large-scale-...
We don't inherit any software, so cognitive function must bootstrap itself from it's underlying structure alone.
Hardware and software, as metaphors applied to biology, I think are better understood as a continuum than a binary, and if we don't inherit any software (is that true?), we at least inherit assembly code.
To stay with the metaphor, DNA could be rather understood as firmware that runs on the cell. What I mean with software is the 'mind' that runs on a collection of cells. Things like language, thoughts and ideas.
There is also a second level of software that runs not on a single mind alone, but collection of minds, to form cliques or a societies. But this is not encoded in genes, but in memes.
I think it's like Chomsky said, that we don't learn this infrastructure for understanding language any more than a bird "learns" their feathers. But I might be losing track of what you're suggesting is software in the metaphor. I think I'm broadly on board with your characterization of DNA, the mind and memes generally though.
> We don't inherit any software
I wonder, though. Many animal species just "know" how to perform certain complex actions without being taught the way humans have to be taught. Building a nest, for example.If you say that this is emergent from the "underlying structure alone", doesn't this mean that it would still be "inherited" software (though in this case, maybe we think of it like punch cards).
But then you have things like language or societal customs that are purely 'software'.
A biological example that I like: the neural structures for vision develop almost fully formed from the very beginning. The state of our network at initialization is effectively already functional. I’m not sure to which extent this is true for humans, but it is certainly true for simpler organisms like flies. The way cells achieve this is through some extremely simple growth rules as the structure is being formed for the first time. Different kinds of cells behave almost independently of each other, and it just so happens that the final structure is a perfectly functional eye. I’ve seen animations of this during a conference talk and it was one of the most fascinating things I’ve ever seen. It truly shows how the complexity of a biological organism is just billions of times any human technology. And at the same time, it’s a beautiful illustration of the lack of intelligent design. It’s like watching a Lego assemble by just shaking the pieces.
How do you claim to know this?
Not to detract from what has been done here in any way, but it all seems entirely consistent with the types of progress we have seen.
It's also no surprise to me that it's from Google, who I suspect is better situated than any of its AI competitors, even if it is sometimes slow to show progress publicly.
I think this was the first mention of world models I've seen circa 2018.
This is based on VAEs though.
I suppose it depends what you count as "the start". The idea of AI as a real research project has been around since at least the 1950s. And I'm not a programmer or computer scientist, but I'm a philosophy nerd and I know debates about what computers can or can't do started around then. One side of the debate was that it awaited new conceptual and architectural breakthroughs.
I also think you can look at, say, Ted Talks on the topic, with guys like Jeff Hawkins presenting the problem as one of searching for conceptual breakthroughs, and I think similar ideas of such a search have been at the center of Douglas Hofstadter's career.
I think in all those cases, they would have treated "more is different" like an absence of nuance, because there was supposed to be a puzzle to solve (and in a sense there is, and there has been, in terms of vector space and back propagation and so on, but it wasn't necessarily clear that physics could "pop out" emergently from such a foundation).
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Kind of like how a single neuron doesn't do much, but connect 100 billion of them and well...
So prescient. I definitely think this will be a thing in the near future ~12-18 months time horizon
What's with this insane desire for anthropomorphism? What do you even MEAN learn in its dreams? Fine-tuning overnight? Just say that!
> What's with this insane desire for anthropomorphism?
Devil's advocate: Making the assumption that consciousness is uniquely human, and that humans are "special" is just as ludicrous.Whether a computational medium is carbon-based or silicon-based seems irrelevant. Call it "carbon-chauvinism".
Since consciousness is closely linked to being a moral patient, it is all the more important to err on the side of caution when denying qualia to other beings.
No-one cares. It's just terminology.
A neural net can produce information outside of its original data set, but it is all and directly derived from that initial set. There are fundamental information constraints here. You cannot use a neural net to itself generate from its existing data set wholly new and original full quality training data for itself.
You can use a neural net to generate data, and you can train a net on that data, but you'll end up with something which is no good.
Are you sure? I've been ingesting boatloads of high definition multi-sensory real-time data for quite a few decades now, and I hardly remember any of it. Perhaps the average quality/diversity of LLM training data has been higher, but they sure remember a hell of a lot more of it than I ever could.
The LLM has plenty of experts and approaches etc.
Give it tool access let it formulate it's own experiments etc.
The only question here is if it becomes a / the singularity because of this, gets stuck in some local minimum or achieves random perfection and random local minimum locations.
There is an uncountably large number of models that perfectly replicate the data they're trained on; some generalize out of distribution much better. Something like dreaming might be a form of regularization: experimenting with simpler structures that perform equally well on training data but generalize better (e.g. by discovering simple algorithms that reproduce the data equally well as pure memorization but require simpler neural circuits than the memorizing circuits).
Once you have those better generalizing circuits, you can generate data that not only matches the input data in quality but potentially exceeds it, if the priors built into the learning algorithm match the real world.
We have truly reached peak hackernews here.
I.e. if the simulation has enough videos of firefighters breaking glass where it seems to drop instantaneously and in the world sim it always breaks, a firefighter robot might get into a problem when confronted with unbreakable glass, as it expects it to break as always, leading to a loop of trying to shatter the glass instead of performing another action.
He seems to me too enthusiastic, such that I feel Google asked him in particular because they trusted him to write very positively.
I've been thinking about this a while and it's obvious to me:
Put Minecraft (or something similar) under the hood. You just need data structures to encode the world. To enable mutation, location, and persistence.
If the model is given additional parameters such as a "world mesh", then it can easily persist where things are, what color or texture they should be, etc.
That data structure or server can be running independently on CPU-bound processes. Genie or whatever "world model" you have is just your renderer.
It probably won't happen like this due to monopolistic forces, but a nice future might be a future where you could hot swap renderers between providers yet still be playing the same game as your friends - just with different looks and feels. Experiencing the world differently all at the same time. (It'll probably be winner take all, sadly, or several independent vertical silos.)
If I were Tim Sweeny at Epic Games, I'd immediately drop all work on Unreal Engine and start looking into this tech. Because this is going to shore them up on both the gaming and film fronts.
I think in this context, it could be amazing for game creation.
I’d imagine you would provide item descriptions to vibe-code objects and behavior scripts, set up some initial world state(maps), populated with objects made of objects - hierarchically vibe-modeled, make a few renderings to give inspirational world-feel and textures, and vibe-tune the world until you had the look and feel you want. Then once the textures and models and world were finalised, it would be used as the rendering context.
I think this is a place that there is enough feedback loops and supervision that with decent tools along these lines, you could 100x the efficiency of game development.
It would blow up the game industry, but also spawn a million independent one or two person studios producing some really imaginative niche experiences that could be much, much more expansive (like a AAA title) than the typical indie-studio product.
> It would blow up the game industry, but also spawn a million independent one or two person studios producing some really imaginative niche experiences that could be much, much more expansive (like a AAA title) than the typical indie-studio product.
All video games become Minecraft / Roblox / VRChat. You don't need AAA studios. People can make and share their own games with friends.
Scary realization: YouTube becomes YouGame and Google wins the Internet forever.
I've seen Roblox's creative tools, even their GenAI tools, but they're bolted on. It's the steam powered horse problem.
https://en.wikipedia.org/wiki/Manufacturing_Consent#:~:text=...
https://www.goodreads.com/book/show/12617.Manufacturing_Cons...
Though this is often associated with his and Herman's "Propaganda Model," Chomsky has also commented that the same appears in scholarly literature, despite the overt propaganda forces of ownership and advertisement being absent:
https://en.wikipedia.org/wiki/Propaganda_model#:~:text=Choms...
The lead in to the quote starts at https://youtu.be/GjENnyQupow?t=662
"I don't say you're self-censoring - I'm sure you believe everything you're saying; but what I'm saying is, if you believed something different, you wouldn't be sitting where you're sitting." -- Noam Chomksy to Andrew Marr
& you're basically seeing GPT-3 and saying it will never be used in any serious application.. the rate of improvement in their model is insane
Use the CPU and RAM for world state, then pass it off to the model to render.
Regardless of how this is done, Unreal Engine with all of its bells and whistles is toast. That C++ pile of engineering won't outdo something this flexible.
I think this puts Epic Games, Nintendo, and the whole lot into a very tough spot if this tech takes off.
I don't see how Unreal Engine, with its voluminous and labyrinthine tomes of impenetrable legacy C++ code, survives this. Unreal Engine is a mess, gamers are unhappy about it, and it's a PITA to develop with. I certainly hate working with it.
Innovator's Dilemma fast approaching the entire gaming industry and they don't even see it coming it's happening so fast.
Exciting that building games could become as easy as having the idea itself. I'm imagining something like VRChat or Roblox or Fortnite, but where new things are simply spoken into existence.
It's absolutely terrifying that Google has this much power.
This is 100% going to happen on-device. It's just a matter of time.
Maybe just as kind of a DLSS on steroids where the engine only renders very simple objects and a world model translates these to the actual graphics.
https://kylekukshtel.com/diffusion-aaa-gamedev-doom-minecraf...
But even when I wrote that I thought things were still a few years out. I facetiously said that Rockstar would be nerd-sniped on GTA6 by a world model, which sounded crazy a few months ago. But seeing the progress already made since GameNGen and knowing GTA6 is still a year away... maybe it will actually happen.
I'm starting to think some of the names behind LLMs/GenAI are cover names for aliens and any actual humans involved have signed an NDA that comes with millions of dollars and a death warrant if disobeyed.
I'm having trouble parsing your meaning here.
GTA isn't really a "drive on the street simulator", is it? There is deliberate creative and artistic vision that makes the series so enjoyable to play even decades after release, despite the graphics quality becoming more dated every year by AAA standards.
Are you saying someone would "vibe model" a GTAish clone with modern graphics that would overtake the actual GTA6 in popularity? That seems extremely unlikely to me.
I despise the creative and artistic vision of GTA online, but I’m clearly in a minority there gauging by how much money they’ve made off it.
Anyways, crafting pretty looking worlds is one thing, but you still need to fill them in with something worth doing, and that's something we haven't really figured out. That's one of the reasons why the sandbox MMORPG was developed as opposed to "themeparks". The underlying systems, the backend is the real meat here. At most with the world models right now is that you're replacing 3d artists and animators, but I would not say that is a real bottleneck in relation to one's own limitations.
Not for video games it isn’t.
I for one would love a video game where you're playing in a psychedelic, dream-like fugue.
It is not currently, or near term, realistic to make a video game where a meaningful portion of the simulation is part of the model.
There will probably be a few interactive model-first experiences. But they’ll be popular as short novelties not meaningful or long experiences.
A simple question to consider is how would you adjust a set of simple tunables in a model-first simulator? For example giving the player more health, making enemies deal 2x damage, increasing move speed, etc etc. You can not.
Reality is not composed of words, syntax, and semantics. A human modal is.
Other human modals are sensory only, no language.
So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.
Software is for those who read the manual with their new NES game. Where are the words inside us?
Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.
https://www.theguardian.com/technology/2025/aug/05/google-st...
Gemini Robot launch 4 mo ago:
good writers will remain scarce though.
maybe we will have personalized movies written entirely through A.I
Actually that game felt a lot like these videos, because often you would turn around and then look back and the game had deleted the NPCs and generated new ones, etc.
I think some people want to play, and some want to experience, in different proportions. Tetris is the emanation of pure gameplay, but then you have to remember "Colossal Cave Adventure" is even older than Tetris. So there's a long history of both approaches, and for one of them, these models could be helpful.
Not that it matters. Until the models land in the hands of indie developers for long enough for them to prove their usefulness, no large developer will be willing to take on the risks involved in shipping things that have the slightest possibility of generating "wrong" content. So, the AI in games is still a long way off, I think.
You must be young. As people get older they (usually) care less about that.
Yes.
> No most people are testing skills in video games
That's not mutually exclusive with playing for scenery.
Games, like all art, have different communities that enjoy them for different reasons. Some people do not want their skills tested at all by a game. Some people want the maximum skill testing. Some want to experience novel fantasy places, some people want to experience real places. Some people want to tell complex weaving narratives, some people want to optimize logistics.
A game like Flower is absolutely a game about looking at pretty scenery and not one about testing skill.
At the limit, if you could stay engaged you would be an expert in pretty much anything.
"It doesn't help students understand why things happened, and what the consequences were and how they have impacted the rest of history of the modern world." I would say the opposite, let's recreate each step in that historical journey so you can see exactly what the concequenses were, exactly why they happened and when.
That's an insane product right there just waiting to happen. Too bad Google sleeps so hard on the tech they create.
You'd have some "please wait in this lobby space while we generate the universe" moments, but those are easy to hide with clever design.
But once you can get N cameras looking at the same world-state, you can make them N players, or a player with 2 eyes.
Really great work though, impressive to see.
1. You can see fine textures "jump" every 4 frames - which means they're most likely using a 4x-temporal-downscaling VAE with at least 4-frame interaction latency (unless the VAE is also control-conditional). Unfortunately I didn't see any real-time footage to confirm the latency (at one point they intercut screen recordings with "fingers on keyboard" b-roll? hmm).
2. There's some 16x16 spatial blocking during fast motion which could mean 16x16 spatial downscaling in the VAE. Combined with 1, this would mean 24x1280x720/(4x16x16) = 21,600 tokens per second, or around 1.3 million tokens per minute.
3. The first frame of each clip looks a bit sharper and less videogamey than later stationary frames, which suggests this is could be a combination of text-to-image + image-to-world system (where the t2i system is trained on general data but the i2w system is finetuned on game data with labeled controls). Noticeable in e.g. the dirt/textures in [2]. I still noticed some trend towards more contrast/saturation over time, but it's not as bad as in other autoregressive video models I've seen.
[1] https://x.com/demishassabis/status/1940248521111961988
[2] https://deepmind.google/api/blob/website/media/genie_environ...
[1] https://x.com/holynski_/status/1952756737800651144
[2] https://togelius.blogspot.com/2025/08/genie-3-and-future-of-...
so better than Stadia?
But if you believe reality is a simulation, why would these “efficient” world-generation methods convince you of anything? The tech our reality would have to be running on is still inconceivable science fiction.
Not like this we haven't. This is convincing because I can have any of you close your eyes and imagine a world where pink rabbits hand out parking tickets. We're a neurolink away from going from thought > to prompt > to fantasy.
To add: our reality does not have to be rendered in it's entirety, we'll just have very convincing and unscripted first-person view simulations. Only what you look at is getting rendered (e.g. tiny structures only get rendered when you use microscope).
Obviously, none of these are super viable given the low accuracy and steerability of world models out today, but positive applications for this kind of tech do exist.
Also (I'm speculating now instead of restating the video), I think pretty soon someone will hook up a real time version of this to a voice model, and we will get some kind of interactive voice + keyboard (or VR) lucid dream experience.
/s
It's a nice step towards gains in embodied AI. Good work, DeepMind.
Sora was described very similar to this as a "world simulator" but ultimately it never materialized.
This one is a bit more hopeful from the videos though.
Creativity is taken from us at exponential rate. And I don't buy argument from people who are saying they are excited to live in this age. I can get that if that technology stopped at current state and remained to be just tools for our creative endeavours, but it doesn't seem to be an endgame here. Instead it aims to be a complete replacement.
Granted, you can say "you still can play musical instruments/paint pictures/etc for yourself", but I don't think there was ever a period of time where creative works were just created for sake of itself rather for sharing it with others at masse.
So what is final state here for us? Return to menial not-yet-automated work? And when this would be eventually automated, what's left? Plug our brains to personalized autogenerated worlds that are tailored to trigger related neuronal circuitry for producing ever increasing dopamine levels and finally burn our brains out (which is arguably already happening with tiktok-style leasure)? And how you are supposed to pay for that, if all work is automated? How economics of that is supposed to work?
Looks like a pretty decent explanation of Fermi paradox. No-one would know how technology works, there are no easily available resources left to make use of simpler tech and planet is littered to the point of no return.
How to even find the value in living given all of that?
With business as usual capital is power and capital is increasingly getting centralized.
Work is fundamental part of society and will never be eliminated, regardless of its utility/usefulness. The cast/class system determines the type of work. The amount (time) of work is set as it was discovered additional leisure and to reduce it does not improve individuals happiness.
1. Universal Basic Income as we're on the way to a post-scarcity society. Unlikely to actually happen due to greed.
2. We take inspiration from the french revolution and then return to a simpler time.
Luigi Mangione has shown that all it takes is one person in the right time and place to remove some evil from the world.
Greed makes no sense in a truly post scarcity society. There is no scarcity from which to take in a zero sum way from another.
Status is the real issue. Humans use status to select sexually, and the display is both competitive and comparative. It doesnt matter absolutely how many pants you have, only that you have more and better than your competition.
I actually think this thing is baked into our DNA and until sex itself is saturated (if there is such a thing), or DNA is altered, we will continue to have a however subtle form of competition undergirding all interactions.
>Vote for me and we'll hand free money to everyone and the robots will do the work
at the moment is the robots doing the work don't exist. Things will change when they do.
It's not. We will be replaced, but the AI will carry on.
a lot of these comments border on cult thinking. it's a fucking text to 3D image model, not R Daneel Olivaw, calm down
I'll concede that it might take even longer to get full artificial human capabilities (robust, selfrepairing, selfreplicating, adaptable), but the writing is on the wall.
Even in the very best case that I see (non-malicious AI with a soft practical ceiling not too far beyond human capabilities) poses giant challenges for our whole society, just in ressource allocation alone (because people, as workers, become practically worthless, undermining our whole system completely).
I don't want to live in a world where these things are generated cheaply and easily for the profit of a very select few group of people.
I know the world doesn't work like I described in the top paragraph. But it's a lot closer to it than the bottom.
There will be two classes of media:
- Generated, consumed en-masse by uncreative, uninspired individuals looking for cheap thrill
- Human created, consumed by discerning individuals seeking out real human talent and expression. Valuing it based merely on the knowledge that a biological brain produced (or helped produce) it.
I tend to suspect that the latter will grow in value, not diminish, as time progresses
people said the world could literally end if we train anything bigger than chatgpt4... I would take these projections with a handful of salt
There’s no bright line between computer and human-created video - computer tools are used everywhere.
Rewarded how? 99.99% of people who do things like sports or artistic like writing never get "rewarded for doing so", at least in the way I imagine you mean the phrase. The reward is usually the experience itself. When someone picks up a ball or an instrument, they don't do so for some material reward.
Why should anyone be rewarded materially for something like this? Why are you so hung up on the <0.001% that can actually make some money now having to enjoy the activity more as a hobby than a profession.
Why am I so "hung up" on the livelihood of these people?
Doing art is a Hobby is a good in and of itself. I did not say otherwise. But when I see a movie, when I listen to a song, I want to appreciate the integrity and talent of the people that wrote them. I want them to get paid for that enjoyment. I don't think that's bizarre.
That world has only existed for the last hundred or so years, and the talent is usually brutally exploited by people whose main talent is parasitism. Only a tiny percentage of people who sell creative works can make a living out of it; the living to be made is in buying their works at a premium, bundling them, and reselling them, while offloading almost all of the risk to the creative as an "advance."
Then you're left in a situation where both the buyer of art and the creator of art are desperate to pander to the largest audience possible because everybody is leveraged. It's a dogshit world that creates dogshit art.
A better example would be Spotify replacing artist-made music recommandations with low-quality alternatives, to reduce what it pays to artists. Everyone except Spotify loses in this scenario.
The future with AI is not going to be our current world with some parts replaced by AI. It will be a whole new way of life.
Water cooler talk about what happened this week in M.A.S.H. or Friends is extinct.
Worse, in the long run even community may be synthesized. If a friend is meat or if they're silicon (or even carbon fiber!), does it matter if you can't tell the difference? It might to pre-modern boomers like me and you.
Virtual influencers might be a big thing, Hatsune Miku has lots of fans. But it's still a shared fandom.
For example, robot boxing: https://www.youtube.com/watch?v=rdkwjs_g83w
Most commercial artists are very much unknown, in the background. This is a different situation from sport
But it might also go the way of pottery, glass-making and weaving. They’re still around but extremely niche.
Numerous famous writers, painters, artists, etc counter this idea, Kafka being a notable example, whose significant works only came to light after his passing and against his will. This doesn't take away from the rest of your discussion point, but art always has and always will also exist solely for its own sake.
Synthetic data can be useful until a certain point, but you can’t expect to have a better model on synthetic data alone indefinitely.
The moat of GDM here is YouTube. That have a bazillion of gameplay and whatever videos. But here it is.
The downside I can see is that most people will stop to publish content online for free since this companies have absolutely no respect whatsoever for the humans that created the data they use.
- Because you enjoy it
- Because you get pats in the back from people you share it with
- Because you want to earn money from it
The 1st one will continue to be true in this dystopian AI art future, the other not so much.
And sincerely I find that kind of human art, the one that comes from a pure inner force, the more interesting one.
EDIT: list formatting
No it won’t, you’ll be too busy trying to survive off of what pittance is left for you to have any time to waste on leisure activities.
My only hope is that we could have created 100k nukes of monstrous yields but collectively decided not to. We instead created 10k smaller ones. We could have destroyed ourselves long ago but managed to avoid it.
If humans are not stretched to their limits, and are still able to be creative, then the tools will help us find our way through this infinite space.
AI will never be able to generate everything for us, because that means it will need infinite computation.
Humans have demonstrated time and again, even things beyond our experience can be explored by us; quantum mechanics for example. Humans find a way to map very complex subjects to our own experience using analogy. Maybe AI can help us go further by allowing us to do this on even more complex ideas.
Edit: left the page open for a while before responding, and the other person responded with basically the same thing within that time.
Similar to how synths meant we no longer need to play an instruments by plucking strings, it hasn’t affected the higher level creativity of creating music, only expanded it.
I can understand it's very interesting from a researcher's point-of-view (I'm a software dev who's worked adjacent to some ML researchers doing pipeline stuff to integrate models into software), but at the same time: Where are the robots to do menial work like clean toilets, kitchens, homes, etc?
I assume the funding isn't there? Or maybe it's much less exciting to research diffusion networks for image generation that working out algorithms for the best way to clean toilets :)
also the billionaires have help so they don't give a shit if the menial stuff is automated or not. throw in a little misogyny by and large too; I saw a LinkedIn Lunatic in the wild (some C-level) saying laundry is already automated because laundry machines exist
fucking.. tell me you don't ever do the laundry without telling me. That guy's poor wife.
I wonder how advanced world models like genie 3 would change the approach if it all.
I sit and play guitar by myself all the time, I play for nobody but myself, and I enjoy it a lot. Your argument is absurd.
Kids do it all the time.
> So what is final state here for us?
Something I haven't seen discussed too much is taste - human tastes change based on what has come before. What we will care about tomorrow is not what we care about today.
It seems plausible to me that generative AI could get higher and higher quality without really touching how human tastes changes. That would leave a lot of room for human creativity IMO - we have shared experience in a changing world that seems very hard to capture with data.
And even so, music production has been a constant evolution of replacing prior technologies and making it easier to get into. It used to be gatekept by expensive hardware.
I wonder if mental exercises will move to the same category? Not necessarily a way to earn money, but something everybody does as a way of flourishing as a human.
Nothing can take away your ability to have incredible experiences, except if the robots kill us all.
For too long has humanity been collectively submerged into this hyper-consumption of the arts. We, our parents and our grandparents have been getting bombarded by some or the other form of artificial dopamine sweets - from videos to reels to xeets to "news" to ads to tunes to mainstream media - every second of the day, every single day. The kind of media consumption we have every day is something our forefathers would have been overwhelmed by within an hour. It is not natural.
This complete cheapening of the arts is finally giving us a chance to shed off this load for good.
"Nothing human makes it out of the near-future."
With UBI, probably. With a central government formed by our robot overlords. But why even pay us at that point?
Wow. What a picture! Here's an optimistic take, fwiw: Whenever we have had a paradigm shift in our ability to process information, we have grappled with it by shifting to higher-level tasks.
We tend to "invent" new work as we grapple with the technology. The job of a UX designer did not exist in 1970s (at least not as a separate category employing 1000s of people; now I want to be careful this is HN, so there might be someone on here who was doing that in the 70s!).
And there is capitalism -- if everyone has access to the best-in-class model, then no one has true edge in a competition. That is not a state that capitalism likes. The economics _will_ ultimately kick in. We just need this recent S-curve to settle for a bit.
I think we have a long way to go yet. Humanity is still in the early stages of its tech tree with so many unknown and unsolved problems. If ASI does happen and solves literally everything, we will be in a position that is completely alien to what we have right now.
> How to even find the value in living given all of that?
I feel like a lot of AI angst comes from people who place their self-worth and value on external validation. There is value in simply existing and doing what you want to do even if nobody else wants it.
We can use these to create entire virtual worlds, games, software that incorporates these, and to incorporate creativity and media into infinitely more situations in real life.
We can create massive installations that are not a single image but an endless video with endless music, and then our hand turns to stabilizing and styling and aestheticizing those exactly in line with our (the artist's) preferences.
Romanticizing the idea that picking at a guitar is somehow 'more creative' than using a DAW to create incredibly complex and layered and beautiful music is the same thing that's happening here, even if the primitives seem 'scarier' and 'bigger'.
Plus, there are many situations in life that would be made infinitely more human by the introduction of our collective work in designing our aesthetic and putting it into the world, and encoding it into models. Installations and physical spaces can absolutely be more beautiful if we can produce more, taking the aesthetic(s) that we've built so far and making them dynamic to spaces.
Also for learning: as a young person learning to draw and sing and play music and so many other things, I would have tremendously appreciated the ability to generate and follow subtle, personalized generation - to take a photo of a scene in front of me and have the AI first sketch it loosely so that I can copy it, then escalate and escalate until I can do something bigger.
What argument is required for excitement? Excitement is a feeling not a rational act. It comes from optimism and imagination. There is no argument for optimism. There is often little reason in imagination.
> How to even find the value in living given all of that?
You might have heard of the Bhagavad Gita, a 2000+ year old spiritual text. It details a conversation between a warrior prince and a manifestation of God. The warrior prince is facing a very difficult battle and he is having doubts justifying any action in the face of the decisions he has to make. He is begging this manifestation of God to give him good reasons to act, good reasons not just to throw his weapons down, give away all his possessions and sit in a cave somewhere.
There are no definite answers in the text, just meditations on the question. Why should we act when the result is ultimately pointless, we will all die, people will forget you, situations will be resolved with or without you, etc.
This isn't some new question that LLMs are forcing us to confront. LLMs are just providing us a new reason to ask the same age-old questions we have been facing for as long as writing has existed.
You don't think there was ever a time without a mass media culture? Plenty of people have furniture older than mass media culture. Even 20 years ago people could manage to be creative for a tiny audience of what were possibly other people doing creative things. It's only the zoomers who have never lived in a world where you never thought to consider how you could sell the song you were writing in your bedroom to the Chinese market.
It used to be that music didn't come on piano rolls, records, tapes, CDs or files. It used to be that your daughter would play music on the piano in the living room for the entire family. Even if it was music that wouldn't really sell, and wasn't perfectly played, people somehow managed to enjoy it. It was not a situation that AI could destroy. If anything, AI could assist.
If your value in living is in any way affected by AI, ever, then, well, let's just say I would never choose that for myself. Good luck.
There's a whole host of "art" that has been created by people - sometimes for themselves, sometimes for a select few friends - which had little purpose beyond that creation[1]. Some people create art because they simply have to create art - for pleasure, for therapy, for whatever[2]. For many, the act of creation was far more important than the act of distribution[3].
For me, my obsession is constructing worlds, maps, societies and languages that will almost certainly die with me. And that's fine. When I feel the compulsion, I'll work on my constructions for a while, until the compulsion passes - just as I have done (on and off) for the past 50 years. If the world really needs to know about me, then it can learn more than it probably wants to know through my poetry.
[1] - Emily Dickinson is an obvious example: https://en.wikipedia.org/wiki/Emily_Dickinson
[2] - Coral Castle, Florida: https://en.wikipedia.org/wiki/Coral_Castle
[3] - Federico Garcia Lorca almost certainly didn't write his Sonetos del amor oscuro for publication - he just needed to write them: https://es.wikisource.org/wiki/Sonetos_del_amor_oscuro
People still value Amish furniture or woodworking despite Ikea existing. I love that if I want a cheap chair made of cardboard and glue that I can find something to satisfy that need; but I still buy nice furniture when I can.
AI creations are analogous. I've seen some cool AI stuff, but it definitely doesn't replace the real "organic" art one finds.
These fears aren't realized if AI never achieves superhuman performance, but what if they do?
(2) AI has already achieved superhuman performance in breadth and, with tuning, depth.
My only hope is this: I think the depression is telling us something real, we are collectively mourning what we see as the loss of our humanity and our meaning. We are resilient creatures though, and hopefully just like the ozone layer, junk food, and even the increasing rejections of social media and screen time, we will navigate it and reclaim what’s important to us. It might take some pain first though.
Yes, AI can make music that sounds decent and lyrics that rhyme and can even be clever. But listen to a couple songs and your brain quickly spots the patterns. Maybe AI gets there some day, but the uncanny valley seems to be quite a chasm - and anything that approaches the other side seems to do so by piling lots of human intention along the way.
The main challenge over the next decade as all our media channels are flooded with generated media will become curation. We desperately need ways to filter human-created content from generated content. Not just for the sake of preserving art, but for avoiding societal collapse from disinformation, which is a much more direct and closer threat. Hell, we've been living with the consequences of mass disinformation for the past decade, but automated and much more believable campaigns flooding our communication platforms will drastically lower the signal-to-noise ratio. We're currently unable to even imagine the consequences of that, and are far from being prepared for it.
This tech needs strict regulation on a global scale. Anyone against this is either personally invested in it, or is ignorant of its dangers.
The way I see it, most people aren't creative. And the people who are creatives are mostly creating for the love of it. Most books that are published are read exclusively by the friends and family of the author. Most musicians, most stand-up comedians, most artist get to show off their works for small groups of people and make no money doing so. But they do it anyway. I draw terrible portraits, make little inventions and sometimes I build something for the home, knowing full well that I do these things for my own enjoyment and whatever ego boost I get from showing these things off to people I know.
I'm doing a marathon later and I've been working my ass off for the prospect of crossing the finishing line as number four thousand and something, and I'll do it again next year.
Or kittens and puppies. Do you think there won't be kittens and puppies?
And that's putting aside all the obvious space-exploration stuff that will probably be more interesting than anything the previous 100 billion humans ever saw.
The merge. (https://blog.samaltman.com/the-merge)
I'm quite enthusiastic. I've always thought mortality sucks.
Nothing is being taken away.
Till then, I just learn the tools with the deepest understanding that I can muster and so far the deeper I go, the less impressed with "automated everything" I become, because it isn't really going to be capable of doing anything people are going to find interesting when the creativity well dries up.
Additionally, video seems like a pretty forward output shape to me - 2D image with a time component. If we were talking 3D assets and animations I wouldn't even know where to start with modeling that as input data for training. That seems really hard to model as a fixed input size problem to me.
If there was comparable 3D data available for training, I'd guess that we'd see different issues with different approaches.
A couple of examples that I could think of quickly: Using these to build games, might be easier if we could interact with the underlying "assets". Getting photorealistic results with intricate detail (e.g. hair, vegetation) might be easier with video based solutions.
There’s absolutely no reason that a game needs to be generated frame-by-frame like this. It seems like a deeply unserious approach to making games.
(My feeling is that it must be easier to train this way.)
3D model rendering would be useful however for interfacing with robots.
In VR, for example, the same 3D scene will be rendered twice, once for each eye, from two viewpoints 10-15cm apart.
If you don’t have an internal 3D representation of the world, the AI would need to generate exactly the same scene from a very slightly different perspective for each eye, without any discrepancies or artefacts.
And that’s not even discussing physics, collisions or any form of consistent world logic that happens off-screen. Or multiplayer!
No need to explore; I can tell you how. Release the weights to the general public so that everyone can play with it and non-Google researchers can build their work upon it.
Of course this isn't going to happen because "safety". Even telling us how many parameters this model has is "unsafe".
While I don't fully align with the sentiment of other commenters that this is meaningless unless you can go hands on... it is crazy to think of how different this announcement is than a few years ago when this would be accompanied by an actual paper that shared the research.
Instead... we get this thing that has a few aspects of a paper - authors, demos, a bibtex citation(!) - but none of the actual research shared.
I was discussing with a friend that my biggest concern with AI right now is not that it isn't capable of doing things... but that we switched from research/academic mode to full value extraction so fast that we are way out over our skis in terms of what is being promised, which, in the realm of exciting new field of academic research is pretty low-stakes all things considered... to being terrifying when we bet policy and economics on it.
To be clear, I am not against commercialization, but the dissonance of this product announcement made to look like research written in this way at the same time that one of the preeminent mathematicians writing about how our shift in funding of real academic research is having real, serious impact is... uh... not confidence inspiring for the long term.
I wonder how much it costs to run something like this.
I feel like as time goes on more and more of these important features are showing up as disconnected proofs of concept. I think eventually we'll have all the pieces and someone will just need to hook them together.
I am more and more convinced that AGI is just going to eventually happen and we'll barely notice because we'll get there inch by inch, with more and more amazing things every day.
I've only ever seen demos of these models where things happen from a first-person or 3rd-person perspective, often in the sort of context where you are controlling some sort of playable avatar. I've never seen a demo where they prompted a model to simulate a forest ecology and it simulated the complex interplay of life.
Hence, it feels like a video game simulator, or put another way, a simulator of a simulator of a world model.
This is a pretty clear example of video game physics at work. In the real world, both the jetski and floating structure would be much more affected by a collision, but in the context of video game physics such an interaction makes sense.
So yeah, it's a video game simulator, not a world simulator.
I don't doubt they're trying to create a world simulator model, I just think they're inadvertently creating a video game simulator model.
It is interesting to think about. This kind of training and model will only capture macro effects. You cannot use this to simulate what happens in a biological cell or tweak a gravity parameter and see how plants grow etc. For a true world model, you'd need to train models that can simulate at microscopic scales as well and then have it all integrated into a bigger model or something.
As an aside, I would love to see something like this for the human body. My belief is that we will only be able to truly solve human health if we have a way of simulating the human body.
While watching the video I was just imagining the $ increasing by the second. But then it's not available at all yet :(
I'm most excited for when these methods will make a meaningful difference in robotics. RL is still not quite there for long-horizon, sparse reward tasks in non-zero-sum environments, even with a perfect simulator; e.g. an assistant which books travel for you. Pay attention to when virtual agents start to really work well as a leading signal for this. Virtual agents are strictly easier than physical ones.
Compounding on that, mismatches between the simulated dynamics and real dynamics make the problem harder (sim2real problem). Although with domain randomization and online corrections (control loop, search) this is less of an issue these days.
Multi-scale effects are also tricky: the characteristic temporal length scale for many actions in robotics can be quite different from the temporal scale of the task (e.g. manipulating ingredients to cook a meal). Locomotion was solved first because it's periodic imo.
Check out PufferAI if you're scale-pilled for RL: just do RL bigger, better, get the basics right. Check out Physical Intelligence for the same in robotics, with a more imitation/offline RL feel.
This feels almost exactly like that, especially the weird/dreamlike quality to it.
I know that everyone always worries about trapping people in a simulation of reality etc. etc. but this would have blown my mind as a child. Even Riven was unbelievable to me. I spent hours in Terragen.
https://extraakt.com/extraakts/google-s-genie-3-capabilities...
Another interesting angle is retrofitting existing 2D content (like videos, images, or even map data) into interactive 3D experiences. Imagine integrating something like this into Google Maps suddenly street view becomes a fully explorable 3D simulation generated from just text or limited visual data.
Genie 3 isn’t that though. I don’t think it’s actually intended to be used for games at all.
This is starting to feel pretty **ing exponential.
Are they just multimodal for everything?
Are foundational time series models included in this category?
I think it just outputs image frames...
1. Almost all human-level civilizations go extinct before reaching a technologically mature “posthuman” stage capable of running high-fidelity ancestor simulations.
2. Almost no posthuman civilizations are interested in running simulations of their evolutionary history or beings like their ancestors.
3. We are almost certainly living in a computer simulation.
I think some/all of these things can roughly true at the same time. Imagine an infinite space full of chaotic noise that arises a solitary Boltzmann brain, top level universe and top level intelligence. This brain, seeking purpose and company in the void, dreams of itself in various situations (lower level universes) and some of those universes' societies seek to improve themselves through deliberate construction of karmic cycle ancestor simulation. A hierarchy of self-similar universes.
It was incredibly comforting to me to think that perhaps the reason my fellow human beings are so poor at empathy, inclusion, justice, is that this is a karmic kindergarten where we're intended to be learning these skills (and the consequences for failing to perform them) and so of course we're bad at it, it's why we're here.
Why would beings in simulations be conscious?
Or maybe running simulations is really expensive and so it's done sometimes (more than "almost none") but only sometimes (nowhere near "we are almost certainly").
Or simulations are common but limited? You don't need to simulate a universe if all you want to do is simulate a city.
The "trilemma" is an extreme example of black-and-white thinking. In the real world, things cost resources and so there are tradeoffs -- so middle grounds are the rule, not extremes.
In game engines it's the engineers, the software developers who make sure triangles are at the perfect location, mapping to the correct pixels, but this here, this is now like a drawing made by a computer, frame by frame, with no triangles computed.
That's not the point, video games are worth chump-change compared to robotics. Training AIs on real-world robotic arms scaled poorly, so they're looking for paths that leverage what AI scales well at.
Eg: Using AI to generate textures, wire models, motion sequences which themselves sum up to something that local graphics card can then render into a scene.
I'm very much not an expert in this space, but to me it seems if you do that, then you can tweak the wire model, the texture, move the camera to wherever you want in the scene etc.
The model can infinitely zoom in to some surface and depict(/predict) what would really be there. Trying to do so via classical rendering introduces many technical challenges
So for example, a game designer might tell the AI the floor is made of mud, but won’t tell the AI what it looks like if the player decides to dig a 10 ft hole in the mud, or how difficult it is to dig, or what the mud sounds like when thrown out of the hole, or what a certain NPC might say when thrown down the hole, etc.
This is already happening to some extent, some games struggle to reach 60 FPS at 4K resolution with maximum graphics settings using traditional rasterization alone, so technologies like DLSS 3 frame generation are used to improve performance.
From my best guess: it's a video generation model like the ones we already head. But they condition inputs (movement direction, viewangle). Perhaps they aren't relative inputs but absolute and there is a bit of state simulation going on? [although some demo videos show physics interactions like bumping against objects - so that might be unlikely, or maybe it's 2D and the up axis is generated??].
It's clearly trained on a game engine as I can see screenspace reflection artefacts being learned. They also train on photoscans/splats... some non realistic elements look significantly lower fidelity too..
some inconsistencies I have noticed in the demo videos:
- wingsuit discollcusions are lower fidelity (maybe initialized by high resolution image?)
- garden demo has different "geometry" for each variation, look at the 2nd hose only existing in one version (new "geometry" is made up when first looked at, not beforehand).
- school demo has half a caroutside the window? and a suspiciously repeating pattern (infinite loop patterns are common in transformer models that lack parameters, so they can scale this even more! also might be greedy sampling for stability)
- museum scene has odd reflection in the amethyst box, like the rear mammoth doesn't have reflections on the right most side of the box before it's shown through the box. The tusk reflection just pops in. This isn't fresnel effect.
Our product was a virtual 3d world made up of satellite data. Think of a very quick, higher-res version of google earth, but the most important bit was that you uploaded a GPS track and it re-created the world around that space. The camera was always focused on the target, so it wasn't a first person point of view, which, for the most part, our brains aren't very good at understanding over an extended period of time.
For those curious about the use case, our product was used by every paraglider in the world, commercial drone operations, transportation infrastructure sales/planning, out-door events promotions (specifically bike and ultramarathon races).
Though I suspect we will see a new form of media come from this. I don't pretend to suggest exactly what this media will be, but mixing this with your photos we can see the potential for an infinitely re-framable and zoomable type of photo media.
Creating any "watchable" content will be challenging if the camera is not target focused, and it makes it difficult to create a storyline if you can't dictate where the viewer is pointed.
What this means is that a robot model could be trained 1000x faster on GPUs compared to training a robot in the physical world where normal spacetime constraints apply.
93po•11h ago
tkgally•10h ago
Mouvelie•7h ago
mclau157•10h ago
93po•7h ago
ducktective•10h ago
93po•7h ago
assword•7h ago
myaccountonhn•9h ago
rane•8h ago
hooverd•8h ago
vaenaes•7h ago