We accidentally solved robotics by watching 1M hours of YouTube

https://ksagar.bearblog.dev/vjepa/

209•alexcos•7mo ago

Comments

okdood64•7mo ago

Does YouTube allow massive scraping like this in their ToS?

dangoodmanUT•7mo ago

What ToS

bobmcnamara•7mo ago

https://www.youtube.com/static?template=terms ?

mouse_•7mo ago

Probably not.

Who cares at this point? No one is stopping ML sets from being primarily pirated. The current power is effectively dismantling copyright for AI related work.

perching_aix•7mo ago

> The current power is effectively dismantling copyright for AI related work.

Out of the loop apparently, could you elaborate? By "the current power" I take you mean the current US administration?

bgwalter•7mo ago

Trump fired the head of the copyright office:

https://www.heise.de/en/news/After-criticism-of-AI-training-...

The "Big Beautiful Bill" contains a clause that prohibits state "AI" legislation.

Trump has a "Crypto and AI czar" who is very active in promoting "AI" on his YouTube propaganda outlet. The same czar also promoted, pre-election of course, accelerated peace with Russia and then stopped talking about the subject altogether.

perching_aix•7mo ago

Oh wow okay, genuinely missed these. Thanks.

snickerdoodle12•7mo ago

> Who cares at this point

Anyone who has a shred of integrity. I'm not a fan of overreaching copyright laws, but they've been strictly enforced for years now. Decades, even. They've ruined many lives, like how they killed Aaron Swartz.

But now, suddenly, violating copyright is totally okay and carries no consequences whatsoever because the billionaires decided that's how they can get richer now?

If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all these developments should concern you.

shadowgovt•7mo ago

Aaron Swartz died of suicide, not copyright.

His death was a tragedy but it wasn't done to him.

marcus_holmes•7mo ago

There's an English phrase "hounded to death", meaning that someone was pursued and hassled until they died. It doesn't specify the cause of death, but I think the assumption would be suicide, since you can't actually die of fatigue.

I think that's what was done to Aaron Swartz.

shadowgovt•7mo ago

Many people have dealt with the law, with copyright infringement, even with gross amounts of it, and had the book thrown at them, and survived the experience.

Swartz was ill. It is a tragedy he did not survive the experience, and indeed, trial is very stressful. But he was no more hounded than any defendant who comes under federal scrutiny and has to defend themselves in a court of law via the trial system. Kevin Mitnick spent a year in prison (first incarceration) and survived it. Swartz was offered six months and committed suicide.

I don't know how much we should change of the system to protect the Aaron Swartzs of the world; that's the mother of all Chesterton's Fences.

shiroiuma•7mo ago

Maybe someone should throw you in prison for a year on some BS made-up charges to see how well you survive it. We can use it as a data point for your argument.

marcus_holmes•7mo ago

Many people get (for example) pneumonia and recover. Some people get pneumonia and die. The people who died of pneumonia died because of pneumonia. The fact that other people survived it doesn't mean that they didn't die of it.

Saying that we should not work on cures for pneumonia because it's a Chesterton Fence is obviously, blatantly, illogical. Saying that we should change the system so that government officials working for moneyed interests can't hound someone to death is similarly illogical.

shadowgovt•7mo ago

Pneumonia doesn't have any societal benefit. The process by which we decide if the law was broken and punishment necessary has obvious benefit. If you mean we should seek a cure for dangerous suicidal depression, I agree. But you surely are not suggesting that, for example, has Swartz been accused of embezzlement that the state drop out finish the charges purely because he's a suicide risk; how would that be just to the people who were stolen from?

And it's a point of semantics, but no; we generally don't say people who died by suicide died by the things going on in their life when they ended it. Everybody has stressors. The suicidal also have mental illness. Mr. Swartz had self-documented his past suicidal ideation.

marcus_holmes•7mo ago

We evolved with pneumonia for some reason. It could easily be a Chesterton Fence. We don't treat this as one because we don't want people to die of it.

I agree that a system of laws has benefit to society. However the system we've worked out for making such laws is clearly being warped and twisted to serve one small section of society at the expense of everyone else.

A clear case being the comment that started this conversation - Swartz was hounded to death for doing the exact same thing that AI companies are doing and they're facing zero punishment. AI executives are not being dragged from their offices by burly policemen and thrown into cells, yet they have done the exact same thing that Swartz did to merit that behaviour. It's not unreasonable to question the societal benefit of this system.

And we totally should say that people died of depression, or financial stress, or legal persecution, or whatever. Most people have suicidal ideation at some point in their lives, that's not unusual. Being hassled to the point where you go through with it is definitely violence. Classing this as "mental illness" and therefore a personality defect is a form of victim blaming.

shadowgovt•7mo ago

> We evolved with pneumonia for some reason. It could easily be a Chesterton Fence.

It's not, and I don't think you're seriously arguing this point so I'm going to ignore it.

It is, I think, a reasonable observation that had Swartz formed an LLC to pursue advanced analysis of academic papers for, I don't know, trends in the language used in research and slurped bunch of JSTOR for that purpose, the trial would have taken longer and involved more lawyers. That's probably an observation that should give us pause. Or not, because nobody argued that's what he did or that was his intent, including him. So I also think the premise of comparison to the current circumstances is flawed; I don't think the CFAA can be applied in a context where people have access rights and go through Google's front door to scan videos for the purpose of training a machine learning algorithm. It might be a TOS violation. It's not hiding a server in a closet with unauthorized physical access, which is what Swartz was accused of.

Intent matters, and, sadly, we never got to the trial where intent could have been proven out.

> Being hassled to the point where you go through with it is definitely violence.

The government does have the monopoly on violence. But I think what happened to Swartz is a far cry from that, as he never got to sentencing, much less trial. There was some light compulsion (requirement to appear in court), of course. But everyone who's ever wanted to contest a parking ticket has to experience that. Sadly, this train of thought goes into a station of "Swartz should have been under professional care if his condition was this much a danger to him," and I don't know how the government should change its behavior if he wasn't. Prosecutors are not prognosticators of the mental health of defendants, and I've never read anywhere that Swartz wanted to be committed for mental illness.

Our system is much harder for defendants grappling with mental illness; I'll acknowledge and argue for change regarding that. I don't know that such change would conclude with "Swartz should never have been accused of committing a crime that a lot of evidence suggests he committed," however.

marcus_holmes•7mo ago

All good points, thanks for the constructive reply.

Your point that Swartz would have had a different result had he formed an LLC, and hired a bunch of lawyers, is definitely the key point here. A legal system that only works for the rich and powerful is not something we should defend, support, or put up with.

His purpose in copying research papers and making them available for free is massively more in the public interest than anything the AI companies are doing. They are, after all, seeking to make a profit at the end of this. And they knowingly and deliberately broke copyright law because it was "too hard" to make any kind of licensing deal with the publishers. You can argue about fair use and transformative purposes (as their lawyers have done), but you can also argue from Swartz's point of view that this information was (to a large extent) publicly funded and therefore belonged to the public, and trying to get the journals to acknowledge that is "too hard". And had he been able to afford lawyers, that's a possible line they could have taken. But he didn't get the chance. As you say, we never got to the trial so we will never know.

It's definitely not a stretch to say that his crime and the AI companies' crimes (which they admit to - they admit to downloading source texts from pirate sites) are comparable, even equivalent. Yet their treatment is not.

My understanding of his treatment is that it was a lot more than "light compulsion" and that he underwent a sustained campaign of enforcement activity and litigation at the hands of a specific prosecutor. But given that the AI companies have had nothing - no criminal charges - just a civil case brought by the authors they admit to ripping off, then I don't think I need to push this point. They are clearly being treated differently to him, despite the similar actions.

shadowgovt•7mo ago

We haven't gotten to the part of the trial for Anthropic yet where we determine whether they actually broke the law when they downloaded from pirate sites. Copyright has multiple exceptions. And on the topic at hand here (training on YouTube videos to understand space and relationships in it), I don't think even Google would want to make the case that it's a violation of copyright.

That's the thing about copyright; it's a whole category of law more based in utility than morality. One of the reasons AI is such a fight right now is that nobody was opposing it as an academic project when it was generating, for example, tools that could go from an image to describing the image, or from an image to recognizing the likely artistic style and helping somebody find the original artist. But with just a few tweaks those tools became devices for generating novel images, and now people are upset. Intent matters.

And again, you are drawing equivalence between harvesting data from openly accessible sources online and hiding a server in a closet with unauthorized physical access to a network. Swartz's prosecution wasn't accusing him of copyright violation; it was accusing him of compromising a network. A far more serious charge; if the researchers in the story here had collected those YouTube videos by wiretapping the fiber optics between two of Google's data centers I suspect they would have concerns.

snickerdoodle12•7mo ago

Crimes generally don't kill the criminal. It's the reaction by authorities that kills (perceived) criminals.

shadowgovt•7mo ago

This is true. In general, the harm done by crime is directed outwards from the perpetrator, not inwards to the perpetrator. In fact, the behaviors that only cause self-harm that we criminalize are relatively few.

snickerdoodle12•7mo ago

So however you want to twist it, he was killed by the government.

shadowgovt•7mo ago

Sorry, I don't see how you arrive there from the fact-pattern. He wasn't a criminal because he never had a trial. He killed himself before he even had a hearing on whether the prosecution's evidence was admissible, much less his opportunity to either prove his innocence or argue the acts he undertook shouldn't by rights be a crime at all.

What should the government (executive or judicial) have done differently to balance the needs of the accused vs. the needs of the enforcement and adjudication of the law here?

snickerdoodle12•7mo ago

The government killed him by threatening insane punishment for something that is practically harmless, and relevant to the original point, is done without a second thought now by the bigcorps to feed their AIs

shadowgovt•7mo ago

Prosecutors do that all the time. Basically nobody dies of it. I'd humbly propose there were unfortunate mitigating circumstances in Mr Swartz's situation that made it unusual. When a person with AIDS dies, did the AIDS kill them or the pneumonia a regular body would have fought off? When a person with deep mental illness commits suicide, did the circumstances of their life kill them or did they succumb to a deep mental illness?

Perhaps we could craft a way to hold people with mental health issue to the same standards we are all held to while simultaneously being more sensitive to their needs. But in general, his story is an unfortunate tragedy of a sick person who took their own life under a stress that doesn't kill most other people, and we adjust the way we prosecute crime at our own peril. It is, as I said elsewhere, the mother of all Chesterton's Fences. Which is not to say it cannot or should not be improved! Only that it be done with great care.

And to be completely clear: Swartz ripped content via back-dooring a secured network physically, in a closet, and (it is alleged) planned to dump that content in public. We'll never really know since he (or his illness) denied himself his day in court, and that's a tragedy; he may have successfully defended himself, or could have been a living example of persevering anyway like Mitnick instead of a martyr. Companies using their authorized accounts to scrape Google are likely at most guilty of a TOS violation and Google may choose to cut their accounts, but it's very hard to make a case that the Google API saying, over and over again, "Yes you may view that video" constitutes either unauthorized access or exceeding the bounds of access under 9-48.000.

It's hard to comment on whether Swartz violated the CFAA. Since he wasn't tried, we'll never really know. He exited life before justice could happen one way or the other.

mouse_•7mo ago

> If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all

Can't even pretend anymore, this season jumped the shark

jagged-chisel•7mo ago

> … like how they killed Aaron Swartz.

I can’t imagine why you’d let the FBI off the hook

MaxPock•7mo ago

They don't and neither do I allow my site - whose content I found on Gemini -scraped

klysm•7mo ago

I don't think they can legally prevent it

perching_aix•7mo ago

My "lawyer" (gpt4o) claims that since YouTube is merely a non-exclusive licensee of the user content upload to their service, even if they have such restrictions in their ToS (they do), they likely would not hold up in court, citing [0]. Something about that non-exclusivity meaning they cannot constrain the copyright further on their own terms. Which I guess makes sense?

And since scraping of publicly available data is not illegal (in the US, according to the aforementioned "lawyer"), it seems like it's okay?

Not legal advice.

[0] https://www.skadden.com/insights/publications/2024/05/distri...

nerdsniper•7mo ago

Per HiQ vs. LinkedIn, it doesn't matter what their ToS says if the scraper didn't have to agree to the ToS to scrape the data. YouTube will serve videos to someone who isn't logged in. So if you've never agreed to YouTube's ToS, you can scrape the videos. If YT forced everyone to log in before they could watch a video, then anyone who wants to scrape videos would have had to agree to the ToS at some point.

olyjohn•7mo ago

It won't serve me videos if I'm not logged in. It tells me to sign in to prove I'm not a bot. How do these people get around this?

nerdsniper•7mo ago

It does for me, in USA on AT&T Fiber using Safari in private browsing mode. Chrome in incognito as well. And phone on mobile YouTube (though I didn't test with uninstalling/reinstalling to reset IDFA and IDFV, so it's not really a valid test)

rzzzt•7mo ago

Friendly unit conversion man at your service: 114 years.

isoprophlex•7mo ago

How much is that in football fields?

forks•7mo ago

If you accept 30 years as the average lifespan of an nfl stadium, 3.8

washadjeffmad•7mo ago

Good catch. Approximately 9,192,631 Turkish decibels.

rzzzt•7mo ago

Fun fact: the International Bureau of Weights and Measures in Paris is the owner of a perfect 0 dB noise floor enclosed in a perfect titanium sphere (with some sheep's wool filling to avoid reflections). There is a small door on the side over which microphone capsules can be inserted for calibration.

(/joke)

klaff•7mo ago

Too bad the joke doesn't work if you understand decibels.

ReptileMan•7mo ago

So a half zoom meeting... or 1/3 Teams one.

perching_aix•7mo ago

I genuinely wish there was a cost estimation feature built into them. Doesn't even have to be even remotely close to the true cost if it's anything like the meetings I attend, there will be enough people and it will go on for long enough to make up for it.

ReptileMan•7mo ago

I worked as consultant. And started billing at normal hourly rates for meetings. You will be surprised how fast the company desire for my participation in them decreased.

hobs•7mo ago

Why would you do anything but that? You want to just chat with me forever the rate is the rate.

contingencies•7mo ago

This is interesting for generalized problems ("make me a sandwich") but not useful for most real world functions ("perform x within y space at z cost/speed"). I think the number of people on the humanoid bandwagon trying to implement generalized applications is staggering right now. The physics tells you they will never be as fast as purpose-built devices, nor as small, nor as cheap. That's not to say there's zero value there, but really we're - uh - grasping at straws...

foobarian•7mo ago

I wonder if a generalized machine would have an advantage from scale, and then putting all the specialized stuff into software. We have seen this play out before.

ahmedbaracat•7mo ago

Well, there’s a middle ground, kinda. Using more specialized hardware (ex: cobots) but deploy state-of-art Physical AI (ML/Computer Vision) on them. We’re building one such startup at ko-br (https://ko-br.com/) :))

contingencies•7mo ago

Quite a few startups in your space. Many deployed with customers. Good luck finding a USP!

jjangkke•7mo ago

Very good point! This area faces a similar misalignment of goals in that it tries to be a generic fit-all solution that is rampant with today's LLMs.

We made a sandwich but it cost you 10x more than it would a human and slower might slowly become faster and more efficient but by the time you get really good at it, its simply not transferable unless the model is genuinely able to make the leap across into other domains that humans naturally do.

I'm afraid this is where the barrier of general intelligence and human intelligence lies and with enough of these geospatial motor skill database, we might get something that mimics humans very well but still run into problems at the edge, and this last mile problem really is a hinderance to so many domains where we come close but never complete.

I wonder if this will change with some sort of computing changes as well as how we interface with digital systems (without mouse or keyboard), then this might be able to close that 'last mile gap'.

esjeon•7mo ago

Note that the username here is a Korean derogatory term for Chinese people.

jcrawfordor•7mo ago

It's an interesting comment, it has the same "compliment the OP, elaborate, raise a further question" format I've seen used by apparently LLM-generated spam accounts on HN. But, the second paragraph is so incoherently structured that I have a hard time thinking an LLM produces it.

jes5199•7mo ago

analogy: a CPU is more expensive, more complicated, more energy demanding than custom made circuitry, in most cases.

xyzzy123•7mo ago

As the vendor you can sell it with the promise that awesomeness is coming "just around the corner" with the next software update.

You can also seek investment without committing to an actual concrete business model.

dotancohen•7mo ago

The value is in the generalisation.

For a single example, in any factory watch how humans are added as ad-hoc machines wherever a problem occurs. Machine N outputting faster than machine N+1 can accept? Have a human stack, and destack, the product between them. No matter the size, shape, it within reason the weight of the product. But most importantly: the process can begin within seconds of the problem occurring. No need for a programmer, developer, or maintenance worker to get involved. Just a clear order from the shift manager.

A general purpose robot with physical interfaces similar to a human would be very valuable for such environments. If it had the software to be as easy to instruct as a human.

contingencies•7mo ago

Your assumption set: conventional factory space, idle humans, traditional management, ad-hoc process with skilled managers. This is similar to the "job shop" mentality in (dying) manufacturing. You additionally assume general purpose magic hardware that can usefully do anything.

Reality: Most value is in shrinking things, excluding humans, automating management, carefully designed process, and specialist hardware that does a subset of things very well. Relying on human(oid)s is a sure-fire way to suck.

dotancohen•7mo ago

Correct, I'm talking about the 98% of factories in the world today and in the near future. Obviously the far future will see changes in manufacturing, just as manufacturing has seen changeds every decade since we've been manufacturing things at scale.

imranq•7mo ago

This was a bit hard to read. It would be good to have a narrative structure and more clear explanation of concepts.

signal-intel•7mo ago

Very intentional. Their response would be: “if you need narrative structure and clear explanation of concepts, yngmi”.

YeGoblynQueenne•7mo ago

And the answer to that would be: WNGTI.

https://www.youtube.com/watch?v=4xmckWVPRaI

Capitalia tantum.

Aurornis•7mo ago

> This was a bit hard to read.

This writing style is prominent on Twitter and niche Discords. It's funny how much I've come to be able to cut right through it, but if you haven't seen much of it it's really hard to parse. That's by design, too. The vibe of this writing style is to project an air of confidence so strong that the author doesn't care if you get it or not. It's a sort of humblebrag where the writing is supposed to flex the author's understanding of the subject while also not caring if you get it or not.

As others have already covered, there's also some heavy stretching of the truth and rewriting of history going on in this post. That's also common of the extreme bravado in this style of semi-impenetrable writing: The vagueness and ambiguities allow the author to make grandiose claims but then wiggle out of them later if someone is astute enough to catch on.

For example: The blog post is written as “We…” but is the author part of the team? Or is he using “we” meaning society in general?

Pyxl101•7mo ago

What's the point in writing something while "not caring" if the reader understands or not? Seems like a false confidence or false bravado to me; it reads like an attempt to project an impression, and not really an attempt to communicate.

dotancohen•7mo ago

This style of writing is very effective at convincing people in their impressionable years of a narrative or viewpoint, often one that is hard to defend with more traditional writing styles.

I hope I'm wrong, but this looks like an effort to normalize such writing style. As this happens, intelligent discourse and rhetoric become harder.

Aurornis•7mo ago

Basically: If you understand the topic well, you’re not the target audience.

This is a type of information arbitrage where someone samples something intellectual without fully understanding it, then writes about it for a less technical audience. Their goal is to appear to be the expert on the topic, which translates into clout, social media follows, and eventually they hope job opportunities.

The primary goal of the writing isn’t to get you to understand the topic clearly, because that would diminish the sense that the author is more knowledgeable than you. The goal is to sound guru-like while making the topic feel impenetrably complex for you, while appearing playfully casual for the author.

dclowd9901•7mo ago

I guess "bullshitting as a career" isn't going away any time soon.

dclowd9901•7mo ago

It would also be good if the perspective of the article would stay put. This "we" and "they" thing was at best confusing and at worst possibly a way to get more clicks or pretend the author had something to do with the work.

richard___•7mo ago

Solved??? Where?

chihuahua•7mo ago

Yeah, wake me up when they have a robot that can wash, peel, cut fruit and vegetables; unwrap, cut, cook meat; measure salt and spices; whip cream; knead and shape dough; and clean up the resulting mess from all of these. Then they will have "solved" part of robotics.

YeGoblynQueenne•7mo ago

>> Yeah, wake me up when they have a robot that can wash, peel, cut fruit and vegetables; unwrap, cut, cook meat; measure salt and spices; whip cream; knead and shape dough; and clean up the resulting mess from all of these.

Someone's getting peckish :P

pr337h4m•7mo ago

IMO, VideoMimic is a better proof-of-concept

https://www.videomimic.net/

https://www.videomimic.net/page1.html

Keyframe•7mo ago

Looks like it was trained on Shaolin Drunken Fist videos. Does it look drunk because of the videos or because there's a discrepancy between videos and it not accounting for gravity and physics in general?

jdmichal•7mo ago

My guess would be lack of actuators. For instance, this robot looks like it has an ankle that can only go up and down, but not roll like a human's. Also, I wonder if there's a center of gravity issue, as it almost always appears to be leaning backwards to even out.

I think it's still pretty impressive in its recoveries, even though there's an unnaturally large number of them necessary. About 8 seconds into the video on the homepage, it almost misses and ends up slipping off the second step. I've eaten shit at missing a couple inch curb, though I don't think "graceful" has ever been used as a descriptor for me. So the fact that it just recovers and keeps going without issue is impressive to me.

namibj•7mo ago

> So the fact that it just recovers and keeps going without issue is impressive to me.

I'm pretty sure that's just a matter of reaction speed and it maintaining a constant focus/vigilance on it's movement that you'd usually not reserve outside of some sports and situations pre-identified as deserving the attention due to danger, like concentrating on balance and not getting into a position that overstresses your joints when you know it's icy.

throwaway198846•7mo ago

I wonder how much language does this model understand. If we pan across text will it fill in sensible next word? How good will it be?

ErrorNoBrain•7mo ago

Someone watched 'Devs' ?

if you havent - highly recommended.

andruby•7mo ago

Do you have a link or a less generic search term?

VladVladikoff•7mo ago

It’s a TV show made by Adam Garland https://m.imdb.com/title/tt8134186/ It’s pretty good sci fi IMHO

root_axis•7mo ago

Not sure why people love this show. Really terrible writing.

dmix•7mo ago

Love Alex Garland but the characters ruin the show.

hahaxdxd123•7mo ago

Extremely oversold article.

> the core insight: predict in representation space, not pixels

We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).

> zero-shot generalization (aka the money shot)

This is easily beaten by flow-matching imitation learning models like what Pi has.

> accidentally solved robotics

They're doing 65% success on very simple tasks.

The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.

accidentallfact•7mo ago

https://news.ycombinator.com/item?id=44073183

sailingparrot•7mo ago

This article contains so many falsehoods and history rewrites that it's pretty painful to read.

rozab•7mo ago

I just wrote a reply to a comment talking about the AI tells this writing has, but it got flagged so my comment disappeared when I hit post. I'll rephrase out of spite:

My first thought upon reading this was that an LLM had been instructed to add a pithy meme joke to each paragraph. They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.

There's also a sense of incoherence in the whole piece. For instance, this section:

"- after: 22 million videos + 1 million images (now we're talking)

they basically hoovered up everything: something-something v2, kinetics, howto100m, and a billion youtube videos"

Was it a billion vids or 22m? It turns out the latter sentence is just rephrasing the list of sources in a cool casual way, and the last one is called YT-Temporal-1B. That's a billion frames of video, not a billion videos.

billstar•7mo ago

Also, the author of the blog "Ksagar Atharva" doesn't appear anywhere in the list of authors on the linked FB research paper with Yann LeCun as a co-author. Unless the blog author is using a heavily modified pseudonym.

The research is very real but the blog post appears to be very fake.

xdfgh1112•7mo ago

It's someone explaining the research as a blog essay right? Which is very commonly done. We=humanity

Kiro•7mo ago

Exactly. It's very obvious what "we" is referring to here.

SV_BubbleTime•7mo ago

> some terminally online people do speak in memes, those people aren't quoting doge in 2025.

You may be surprised to find out how incorrect this.

I can think of two popular conservative sites likely to quote Doge people off hand that do this. I read all news in order not be an insufferable ideologue. So again, off the top of my head, NotTheBee (I think affiliated to BabylonBee (conservative The Onion)) and Twitchy. Among YouTubers, I think Asmond Gold, and I’m sure others like Steven Crowder who himself is in a famous meme.

That said… yea, you are probably right.

tomrod•7mo ago

Aren't those sites primarily Russian bots tho?

mlinhares•7mo ago

Isn’t that just a synonym for conservative?

mid-kid•7mo ago

Not conservative but I used to love the meme before it was co-opted by musk, so I will occasionally use it as a "haha now you feel OLD" without thinking of its modern connotations.

pjerem•7mo ago

Also I think it’s somehow important to not let fascism steal our cultural heritage, even if it’s just a meme.

In my country, far righters are displaying the country’s flag everywhere. Now you can’t display a French flag without being thought as a far right person. That’s honestly insufferable.

I know it’s less important with doge but still : before being a crypto it was just a picture of an overly innocent and enthusiastic dog. And even when it became a little crypto, it was totally assumed that it was a meme coin and wasn’t meant for speculation, the idea was that 1DOGE = 1DOGE only and people gifted them to other people who made nice contributions on the internet.

Musk broke all of this when it started to use it to do gigantic pumps and dumps using his own visibility on Twitter.

We don’t have to let fascism steal all the popular symbols / memes, because they will steal them anyway.

foxglacier•7mo ago

Lets see you try to recover the swastika from fascism ;)

bazhova•7mo ago

They are referring to the original doge meme of the dog, not the government initiative today. I guess "quote" isn't really the right word, more like "doing"

YeGoblynQueenne•7mo ago

A reoccurring mistake in this thread. I blame Elon Musk and his boomer humer.

HeartStrings•7mo ago

Yeah, obviously LLM written. They tried to be unique by removing capitals.

Thorrez•7mo ago

>those people aren't quoting doge in 2025

Could you explain what this means? Is this article quoting doge?

debugnik•7mo ago

There was a clear attempt at the doge meme format, yes:

> very scientific. much engineering.

Emphasis on attempt because you're supposed to use words with grammatically incorrect modifiers, and the first one doesn't. (Even the second one doesn't seem entirely incorrect to me? I'm not a native speaker though.) "many scientific, so engineering" for example would have worked.

I assume they, or most likely their LLM, tried too hard to follow the most popular sequence (very, much, wow) and failed at it.

shubb•7mo ago

"Much engineering was required" Archaic but still used a bit in articles or to give a certain vibe.

jojobas•7mo ago

You'd think it would be easy to write "very engineering, much scientific". LLMs work in mysterious ways.

wincy•7mo ago

I don’t know, 400k people are listening to the White House streaming lo-fi hip hop on X right now with cutesy videos of Trump on one side and his executive orders streaming on the other at 4am. I think there’s plenty of people quoting doge in 2025.

If you’re in the US, you likely work with them and they have learned to studiously avoid talking about politics except in vagaries to avoid conflict.

bazhova•7mo ago

they are referring to doge the dog meme, not the government initiative. The meme is much older and wouldn't be considered "cool" to use by the same people who write in the style of the article. Which indicates it was written by an LLM, because usually only things like ChatGPT throw in such cringe, out of date memes in an otherwise obnoxiously 2025 article

roveo•7mo ago

I'm using eigenrobot's (X user) prompt for ChatGPT and the style is very recognizable. Everything lowercase, tone, zoomer abbreviations, esotheric style of jokes.

bjornarv•7mo ago

yup

YeGoblynQueenne•7mo ago

>> They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.

Cringely, they are. Nobody who isn't desperate to appear cool would write in that terminally grating register, including when using an LLM to do the writing.

daft_pink•7mo ago

My mom said I was throwing away my life watching YouTube all day and clearly I just haven’t been watching YouTube enough. 1 million YouTube videos here I come!

october8140•7mo ago

I was unable to make through the article (now we're talking).

dimatura•7mo ago

"why didn't we think of this sooner?", asks the article. Not sure who the "we" is supposed to be, but the robotics community has definitely thought of this before. https://robo-affordances.github.io/ from 2023 is one pretty relevant example that comes to mind, but I have recollections of similar ideas going back to at least 2016 or so (many of which are cited in the V-JEPA2 paper). If you think data-driven approaches are a good idea for manipulation, then the idea of trying to use Youtube as a source of data (an extremely popular data source in computer vision for the past decade) isn't exactly a huge leap. Of course, the "how" is the hard part, for all sorts of reasons. And the "how" is what makes this paper (and prior research in the area) interesting.

a_t48•7mo ago

I definitely saw somebody at Actuate last year talking about supplementing training videos for VLA with Youtube, but I think they actually found that "any" video of the real world helped give a better physics "understanding" to the model.

canyp•7mo ago

I don't know. I'm not the expert, but if you've ever tried to a backflip or anything where your toes are above your head, then you'll know that spatial awareness goes well beyond vision. Or if you throw a frisbee for the dog to catch, they don't actually look at it while running; they look, predict position, then move in. Veni, vidi, vici. So any model that "learns physics" just through vision seems flawed from the start. What's your thought there?

dchftcs•7mo ago

Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.

Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.

With that said, I suppose a robot can be made to practice in real life after learning something from vision.

namibj•7mo ago

If the robot already knows "how to" the happy path, the training difficulty falls severely at least if it can continue after a recovery.

dchftcs•7mo ago

The tasks you do to recover from the failure is often different from the happy path. For example, the happy path of dumping garbage is carrying a garbage bag to a collection bin. The non-happy path is that the bin is overflowing and you have to put the bag on the ground, or if the bag leaks and you need to move to a new bag, or if the bag breaks entirely and you have to pick up the trash again.

But yeah, I think a better way to put it is that sampling the happy path would indeed make the failure case easier, but sampling just happy paths is far from sufficient from completing even some of the simplest human tasks with failure.

rocqua•7mo ago

On humans, you can generally see the force they apply by looking at strain.

dchftcs•7mo ago

The error margins will be huge, and for small enough force (like the skinning part or handling fine mechanical stuff) there's basically almost zero signal.

carlosdp•7mo ago

> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

I'm not sure that's necessarily true for a lot of tasks.

A good way to measure this in your head is this:

"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.

jpc0•7mo ago

I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.

Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.

stavros•7mo ago

If I have to pour water into my mouth, you can bet it's going all over my shirt. That's not how we drink.

jpc0•7mo ago

Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.

The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information

jrimbault•7mo ago

A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.

I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.

gregmac•7mo ago

And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.

If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.

var_cw•7mo ago

The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.

datameta•7mo ago

> generalizability doesn't come from multi-modality but by scaling a single modality itself

Could you expand on what you mean by this?

godelski•7mo ago

  > LLMs already that generalizability

This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.

Also, LLMs these days aren't trained on just language

moefh•7mo ago

> It therefore follows that robots should be able to learn with just RGB images too!

I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.

amelius•7mo ago

You'd use a two-step approach.

1. First create a model that can evaluate how well a task is going; the YT approach can be used here.

2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.

godelski•7mo ago

You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation

amelius•7mo ago

What I describe is an unsupervised system.

What you say ("interventional") sounds like it's human-supervised.

But maybe I'm interpreting it in the wrong way, so please correct me if so.

godelski•7mo ago

By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.

This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/

amelius•7mo ago

Yes, you need to let the robot play (interact with the environment) to learn the vision-versus-touch correlations, but you can do so in an unsupervised way (as long as you choose the environment wisely).

jaisio•7mo ago

> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.

> It therefore follows that robots should be able to learn with just RGB images too!

That does not follow at all! It's not how you learned either.

Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.

suddenlybananas•7mo ago

Humans have innate knowledge that help them interact with the world and can learn from physical interaction for the rest. RGB images aren't enough.

whatever1•7mo ago

Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.

suddenlybananas•7mo ago

Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.

godelski•7mo ago

I mean they do but we often have generalized (to some degree) world models. So when they do things like change gravity, flip things upside down, or even more egregious changes we can adapt. Because we have contractual counterfactual models. But yeah, they could change things so much that you'd really have to relearn and that could be very very difficult if not impossible (I wonder if anyone has created a playable game with a physics that's impossible for humans to learn, at least without "pen and paper". I think you could do this by putting the game in higher dimensions.)

deadfoxygrandpa•7mo ago

a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback

whatever1•7mo ago

I beg to disagree. I got introduced to brand new (to me) physics of flying airplanes by MS flight simulator. None of the rules I knew in real life applied (gravity matters only sometimes, height can be traded for speed etc). Yet learned how to fly.

And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).

abenga•7mo ago

Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.

amelius•7mo ago

Yes, without extra information, manipulating everyday objects is probably as intuitive to robots as manipulating quantum scale molecules is for humans.

deadfoxygrandpa•7mo ago

counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input

corimaith•7mo ago

>"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.

godelski•7mo ago

  > because you as a human have really good intuition about the world.

This is the line that causes your logic to fail.

You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.

The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."

[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...

godelski•7mo ago

  > Pure vision will never be enough because it does not contain information

Say it louder for those in the back!

But actually there's more to this that makes the problem even harder! Lack of sensors is just the beginning. There's well known results in physics that:

  You cannot create causal models through observation alone.

This is a real pain point for these vision world models and most people I talk to (including a lot at the recent CVPR) just brush this off as "we're just care if it works." Guess what?! Everyone that is pointing this out also cares that it works! We need to stop these thought terminating cliches. We're fucking scientists.

Okay, so why isn't observation enough? It's because you can't differentiate alternative but valid hypotheses. You often have to intervene! We're all familiar with this part. You control variables and modify one or a limited set at a time. Experimental physics is no easy task, even for things that sound rather mundane. This is in fact why children and animals play (okay, I'm conjecturing here).

We need to mention chaos here, because it's the easiest way to understand this. There's many famous problems that fall into this category like the double pendulum, 3 Body Problem, or just fucking gas molecules moving around. Let's take the last one. Suppose you are observing some gas molecules moving inside a box. You measure their positions at t0 and at T. Can you predict their trajectories between those time points? Surprisingly, the answer is no. You can only do this statistically. There's probably paths but not deterministic (this same logic is what leads to multiverse theory btw). But now suppose I was watching the molecules too, but I was continuously recording between t0 and T. Can I predict the trajectories? Well, I don't need to, I just write it down.

Now I hear you, you're saying "Godelski, you observed!" But the problem with these set of problems is that if you don't observe the initial state you can't predict moving forwards and if you don't have very precise observation intervals you are hit with the same problem. I you turn around while I start a double pendulum you can have as much time as you want when you turn back around, you won't be able to model its trajectories.

But it gets worse still. There are confounding variables. There is coupling. Difficult to differentiate hypotheses via causal ordering. And so so much more. If you ever wonder why physicists do so much math it's because doing that is a fuck ton easier than doing the whole set of testing and then reverse engineering the equations from those observations. But in physics we care about counterfactual statements. In F=ma we can propose new masses and new accelerations and rederive the results. That's the what it is all about. Your brain does an amazing job at this too! You need counterfactual modeling to operate in real world environments. You have to be able to ask and answer "what happens if that kid runs into the street?"

I highly suggest people read The Relativity of Wrong [0]. Its a short essay by Isaac Asimov that can serve as a decent intro, though far from complete. I'm suggesting it because I don't want people to confuse "need counterfactual model" with "need the right answer." If you don't get into metaphysics, these results will be baffling.[1] It is also needed to answer any confusion you might have around the aforementioned distinction.

Tldr:

  if you could do it from observation alone, physics would have been solved a thousand years ago

There's a lot of complexity and depth that is easy to miss with the excitement, but it still matters.

I'm just touching the surface here too, and we're just talking about mechanics. No quantum needed, just information loss

[0] https://hermiene.net/essays-trans/relativity_of_wrong.html

[1] maybe this is why there are so few physicists working on the world modeling side of ML. At least, using that phrase...

m3kw9•7mo ago

Solving robotics is some claim.

dfedbeef•7mo ago

Spoiler: not solved

Joel_Mckay•7mo ago

Indeed, the robotics edge-case problem space complexity balloons far faster than most assume.

Physics informed training is a real methodology (simple introduction to the subject: https://www.youtube.com/@Eigensteve/videos ).

However, the slop article is 80% nonsense. =3

m3kw9•7mo ago

Just the hand, you have 50 things you just can’t do unless you have certain feel. Handling glass? Oh it’s greasy, now your rubber grip is screwed, now go wash it off, and dry it to start again.

Joel_Mckay•7mo ago

Indeed, there is a whole category of Adaptive Compliant Grippers for specific use-cases like handling delicate brittle objects.

Things that randomly change shape or appearance are also very difficult to interact with safely. The force sensing platform from Universal Robots is safer for users, but it has limitations like all platforms. =3

amelius•7mo ago

Betteridge not only applies to headlines with questions but it also works quite well with Twitter style headlines.

moneywaters•7mo ago

So video gen models basically can be extrapolated to control robotics ? How long until Veo3 robots take over?

a-dub•7mo ago

i thought all the cool data driven robotics stuff was like reinforcement learning from sensors that track moving effectors in the real world with online retraining that mimics the sensorimotor experimentation that is observed during the developmental phases of real neurobiological systems?

so you just kinda let it run for a while and it bumps and squirms around until it stands up or whatever.

seems also the future for real ai?

hbarka•7mo ago

Dr Fei-Fei Li talks about this as the LWM (Large World Model) during this interview: https://youtu.be/fQGu016AlVo and with https://www.worldlabs.ai/

column•7mo ago

> right now, you need to show the robot pictures of what you want it to do. want it to "clean the kitchen"? better have a photo of a clean kitchen handy.

What about using Flux Kontext (or Controlnets) to turn the messy kitchen into a clean kitchen?

seydor•7mo ago

Sure thing, let me just put the fridge in the washing machine.

weinzierl•7mo ago

I do not know and do not care much about robotics per se, but I wish LLM's were better with spatial reasoning. If the new insight helps with that - great!

I dabbled a bit in geolocation with LLM's recently. It is still surprising to me how good they are with finding the general area a picture was taken. Give it a photo of a random street corner on this earth and it is likely will not only tell you the correct city or town but most often even the correct quarter.

On the other hand, if you ask it for a birds eye view of a green, a brown and a white house on the north side of a one-way street (running west to east) east of an intersection running north to south, it may or may not get it right. If you want it to add an arrow going in the direction of the one-way street, it certainly has no clue at all and the result is 50/50.

6510•7mo ago

Put tiny cams on robot arms and let it control them. They can be flimsy for safety. If it is sure something is happening say nothing, if it is 70-99% sure have it guess what is going on, if <70% have it ask what is going on.

trhway•7mo ago

>camera pose sensitivity

>the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.

Reminds that several years ago Tesla had to finally start to explicitly extract 3D model from the net. Similarly i expect here that it would get pipelined - one model extracts/builds 3D, and the other is actually the "robot" working in that 3D. Each one can be alone trained much better and efficiently, with much better transfer and generalization, than the large monolithic model working from the 2D video. In pipeline approach, it is very easy to generate synthetic input 3D data better covering interesting scenario space for the "robot" model.

And, for example, you can't just, without significant training, feed the large monolithic model a lidar point space instead of videos. Whereis in a pipelined approach, you just switch the 3D generating pipeline input model.

liendolucas•7mo ago

I didn't understand a single word about this post and what was supposed to be solved and had to stop reading.

Was this actually written by a human being? If so, the author(s) suffer from severe language communication problems. Doesn't seem to be grounded at least with reality and my personal experience with robotics. But here's my real world take:

Robotics is going to be partially solved when ROS/ROS2 becomes effectively exterminated and completely replaced by a sane robotics framework.

I seriously urge the authors to use ROS/ROS2. Show us, implementing your solution with ROS, pushing it to a repository and allow others to verify what you solved, maybe?. Suffer a bit with the framework and then write a real post about real robotics hands-on, and not just wander on fancy uncomprehensible stuff that probably no-one will ever do.

Then we can maybe start talking about robotics.

rage4774•7mo ago

I totally agree with you. On the other hand the theory behind it -to combine image recognition to predict the outcome based on specific physical impacts- does sound intriguing and like a somewhat newer idea.

But besides that, you‘re totally right. It’s too „loose“ since to realize that idea the process would have to be way different (and properly explained)

w4•7mo ago

It is readily understandable if you are fluent in the jargon surrounding state of the art LLMs and deep learning. It’s completely inscrutable if you aren’t. The article is also very high level and disconnected from specifics. You can skip to FAIR’s paper and code (linked at the article’s end) for specifics: https://github.com/facebookresearch/vjepa2

If I had to guess, it seems likely that there will be a serious cultural disconnect as 20-something deep learning researchers increasingly move into robotics, not unlike the cultural disconnect that happened in natural language processing in the 2010s and early 20s. Probably lots of interesting developments, and also lots of youngsters excitedly reinventing things that were solved decades ago.

godelski•7mo ago

  > if you are fluent in the jargon surrounding state of the art LLMs and deep learning

It is definitely not following that jargon. Maybe it follows the tech influencer blog post jargon but I can definitively say it doesn't follow jargon used in research. Which, they are summarizing a research paper. Consequently they misinterpret things and use weird phrases like "actionable physics," which is self referential. "A" physics model is necessarily actionable. It is required to be a counterfactual model. While I can understand the rephrasing to clarify to a more general audience that's a completely different thing than "being fluent in SOTA work." It's literally the opposite...

Also, it definitely doesn't help that they remove all capitalization except in nouns.

YeGoblynQueenne•7mo ago

It's not a scholarly article but a blog post but you're still right to be frustrated at the very bad writing. I do get the jargon, despite myself, so I can translate: the authors of the blog post claim that machine learning for autonomous robotics is "solved" thanks to an instance of V-JEPA 2 trained on all videos on youtube. It isn't, of course, and the authors themselves point out the severe limitations of the otherwise promising approach (championed by Yan LeCun) when they say, in a notably more subdued manner:

>> the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.

>> in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.

>> long-horizon drift

>> try to plan more than a few steps ahead and the model starts hallucinating.

That is to say, not quite ready for the real world, V-JEPA 2 is.

But for those who don't get the jargon there's a scholarly article linked at the end of the post that is rather more sober and down-to-earth:

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

https://arxiv.org/abs/2506.09985

In other words, some interesting results, some new SOTA, some incremental work. But lots of work for a big team of a couple dozen researchers so there's good stuff in there almost inevitably.

godelski•7mo ago

  > Doesn't seem to be grounded at least with reality and my personal experience with robotics.

It also doesn't match my personal experience with physics nor ML, and I have degrees in both.

You cannot develop accurate world models through observation alone, full stop.

You cannot verify accurate world models through benchmarks alone, full stop.

These have been pain points in physics for centuries and have been the major pain point even before the quantum revolution. I mean if it were possible, we'd have solved physics long ago. You can find plenty of people going back thousands of years boldly claiming "there is nothing new to be learned in physics," yet it was never true and still isn't true even if we exclude quantum and relativity.

Side note: really the paper is "fine" but I wish we didn't put so much hype in academic writing. Papers should be aimed at other academics and not be advertisements (use the paper to write advertisements like IFLS or Quanta Magazine, but don't degrade the already difficult researcher-to-researcher communication). So I'm saying the experiments are fine and the work represents progress but it is over sold and the conclusions do not necessarily follow

Btw, the paper makes these mistakes too. It makes a very bold assumption that counterfactual models (aka a "world model") are learned. This cannot be demonstrated through benchmarking, it must be proven through interpretability.

Unfortunately, the tail is long and heavy... you don't need black swan events to disrupt these models and boy does this annoying fact make it easy to "hack" these types of models. And frankly, I don't think we want robots operating in the wild (public spaces, as opposed to controlled spaces like a manufacturing floor) if I can make it think an iPhone is an Apple with just a stickynote. Sure, you can solve that precise example but it's not hard to come up with others. It's a cat and mouse game, but remember, Jerry always wins.

poulpy123•7mo ago

No you didn't, and I don't even need to click on the link to know it

teleforce•7mo ago

It seems that in order for robotics and automation to work properly, AI models including LLMs, YOLO, RL and others need helps from their distant cousins namely logic, optimization and constraint programming that can be attributed as intelligent automation or namely IA [1],[2],[3],[4].

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:

https://youtube.com/watch?v=HB5TrK7A4pI

[3] Google OR-Tools:

https://developers.google.com/optimization

[4] MiniZinc:

https://www.minizinc.org/

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Hoot: Scheme on WebAssembly

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Hoot: Scheme on WebAssembly

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

We accidentally solved robotics by watching 1M hours of YouTube

Comments