Gemini Robotics On-Device brings AI to local robotic devices

https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/

133•meetpateltech•8h ago

Comments

suninsight•7h ago

This will not end well.

sajithdilshan•6h ago

I wonder what kind of guardrails (like Three Laws of Robotics) there are to prevent the robots going crazy while executing the prompts

hn_throwaway_99•6h ago

A power cord?

sajithdilshan•6h ago

what if they are battery powered?

msgodel•6h ago

Usually I put master disconnect switches on my robots just to make working on them safe. I use cheap toggle switches though I'm too cheap for the big red spiny ones.

pixl97•6h ago

[Robot learns to superglue the switch open]

msgodel•6h ago

It's only going to do that if you RL it with episodes that include people shutting it down for safety. The RL I've done with my models are all simulations that don't even simulate the switch.

pixl97•4h ago

Which will likely work for only on machine AI, but it seems to me any very complicated actions/interactions with the world may require external interactions with LLMs which know these kind of actions. Or in the future the models will be far larger and more expansive on device containing this kind of knowledge.

For example, what if you need to train the model to keep unauthorized people from shutting it off?

msgodel•3h ago

Having a robot near people with no master off switch sounds like a dumb idea.

bigyabai•5h ago

That's what we use twelve gauge buckshot for, here in America.

ctoth•6h ago

The laws of robotics were literally designed to cause conflict and facilitate strife in a fictional setting--I certainly hope no real goddamn system is built like that,.

> To ensure robots behave safely, Gemini Robotics uses a multi-layered approach. "With the full Gemini Robotics, you are connecting to a model that is reasoning about what is safe to do, period," says Parada. "And then you have it talk to a VLA that actually produces options, and then that VLA calls a low-level controller, which typically has safety critical components, like how much force you can move or how fast you can move this arm."

conception•6h ago

Of course someone will. The terror nexus doesn’t build itself, yet, you know.

hlfshell•5h ago

The generally accepted term for the research around this in robotics is Constitutional AI (https://arxiv.org/abs/2212.08073) and has been cited/experimented with in several robotics VLAs.

JumpCrisscross•2h ago

Is there any evidence we have the technical ability to put such ambiguous guardrails on LLMs?

asadm•4h ago

in practice, those laws are bs.

suyash•6h ago

What sort of hardware does the SDK runs on, can it run on a modern Raspberry Pi ?

ethan_smith•6h ago

According to the blog post, it requires an NVIDIA Jetson Orin with at least 8GB RAM, and they've optimized for Jetson AGX Orin (64GB) and Orin NX (16GB) modules.

v9v•6h ago

Could you quote where in the blog post they claim that? CTRL+F "Jetson" gave no results in TFA.

moffkalast•4h ago

Yeah they didn't really mention anything, I was almost getting my hopes up that Google might be announcing a modernized Coral TPU for the transformer age, but I guess not. It's probably all just API calls to their TPUv6 data centers lmao.

martythemaniak•4h ago

You can think of these as essentially multi-modal LLMs, which is to say you can have very small/fast ones (SmolVLA - 0.5B params) that are good at specific tasks, and larger/slower more general ones (OpenVLA - a finetuned llama2 7B). So a rpi could be used for some very specific tasks, but even the more general ones could run on beefy consumer hardware.

Toritori12•6h ago

Does Anyone know how easy is to join the "trusted tester program" and if they offer modules that you can easily plug-in to run the sdk?

martythemaniak•5h ago

I've spent the last few months looking into VLAs and I'm convinced that they're gonna be a big deal, ie they very well might be the "chatgpt moment for robotics" that everyone's been anticipating. Multimodal LLMs already have a ton of built-in understanding of images and text, so VLAs are just regular MMLLMs that are fine-tuned to output a specific sequence of instructions that can be fed to a robot.

OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.

The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...

Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.

ckcheng•4h ago

VLA = Vision-language-action model: https://en.wikipedia.org/wiki/Vision-language-action_model

Not https://public.nrao.edu/telescopes/VLA/ :(

For completeness, MMLLM = Multimodal Large language model.

generalizations•3h ago

I will be surprised if VLAs stick around, based on your description. That sounds far too low-level. Better hand that off to the 'nervous system' / kernel of the robot - it's not like humans explicitly think about the rotation of their hip & ankle when they walk. Sounds like a bad abstraction.

Workaccount2•3h ago

I don't think transformers will be viable for self driving cars until they can both:

1) Properly recognize what they are seeing without having to lean so hard on their training data. Go photoshop a picture of a cat and give it a 5th leg coming out of it's stomach. No LLM will be able to properly count the cat's legs (they will keep saying 4 legs no matter how many times you insist they recount).

2.) Be extremely fast at outputting tokens. I don't know where the threshold is, but its probably going to be a non-thinking model (at first) and probably need something like Cerebras or diffusion architecture to get there.

martythemaniak•2h ago

1. Well, based on Karpathy's talks on Tesla FSD, his solution is to actually make the training set reflect everything you'd see in reality. The tricky part is that if something occurs 0.0000001% IRL and something else occurs 50% of the time, they both need to make 5% of the training corpus. The thing with multimodal LLMs is that lidar/depth input can just be another input that gets encoded along with everything else, so for driving "there's a blob I don't quite recognize" is still a blob you have to drive around.

2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix

baron816•4h ago

I’m optimistic about humanoid robotics, but I’m curious about the reliability issue. Biological limbs and hands are quite miraculous when you consider that they are able to constantly interact with the world, which entails some natural wear and tear, but then constantly heal themselves.

marinmania•4h ago

It does either get very exciting or very spooky thinking of the possibilities in the near future.

I had always assumed that such a robot would be very specific (like a cleaning robot) but it does seem like by the time they are ready they will be very generalizable.

I know they would require quite a few sensors and motors, but compared to self-driving cars their liability would be less and they would use far less material.

fragmede•3h ago

The exciting part comes when two robots are able to do repairs on each other.

pryelluw•3h ago

2 bots 1 bolt ?

marinmania•2h ago

I think this is the spooky part. I feel dumb saying it, but is there a point where they are able to coordinate and build a factory to build chips/more of themselves? Or other things entirely?

bamboozled•37m ago

Of course there is

didip•4h ago

I think those problems can be solved with further research in material science, no? Combined that with very responsive but low torque servos, I think this is a solvable problem.

michaelt•1h ago

It's a simple matter of the number of motors you have. [1]

Assume every motor has a 1% failure rate per year.

A boring wheeled roomba has 3 motors. That's a 2.9% failure rate per year, and 8.6% failures over 3 years.

Assume a humanoid robot has 43 motors. That gives you a 35% failure rate per year, and 73% over 3 years. That ain't good.

And not only is the humanoid robot less reliable, it's also 14.3x the price - because it's got 14.3x as many motors in it.

[1] And bearings and encoders and gearboxes and control boards and stuff... but they're largely proportional to the number of motors.

mewpmewp2•1h ago

Would it be possible to reduce the failure rates?

michaelt•38m ago

To an extent, yes.

For example, an industrial robot arm with 6 motors achieves much higher reliability than a consumer roomba with 3 motors. They do this with more metal parts, more precision machining, much more generous design tolerances, and suchlike. Which they can afford by charging 100x as much per unit.

ac29•35m ago

The 1%/year failure rate appears to just be made up. There are plenty of electric motors that dont have anywhere near that failure rate (at least during the expected service life, failure rates certainly will probably hit 1%/year or higher eventually).

For example, do the motors in hard drives fail anywhere close to 1% a year in the first ~5 years? Backblaze data gives a total drive failure rate around 1% and I imagine most of those are not due to failure of motors.

UltraSane•2h ago

Consumable components could be automatically replaced by other robots.

zzzeek•3h ago

THANK YOU.

Please make robots. LLMs should be put to work for *manual* tasks, not art/creative/intellectual tasks. The goal is to improve humanity. not put us to work putting screws inside of iphones

(five years later)

what do you mean you are using a robot for your drummer

Workaccount2•3h ago

I continued to be impressed how Google stealth releases fairly groundbreaking products, and then (usually) just kind of forgets about them.

Rather than advertising blitz and flashy press events, they just do blog posts that tech heads circulate, forget about, and then wonder 3-4 years later "whatever happened to that?"

This looks awesome. I look forward to someone else building a start-up on this and turning it into a great product.

fusionadvocate•28m ago

Because the whole purpose of these kinds of projects at Google is to keep regulators at bay. They don't need these products in the sense of making money from them. They will just burn some money and move on, exactly the way they did hundreds of times. But what kind of company has such a free pass to burning money? The kind of company that is a monopoly. Monopolies are THAT profitable.

jagger27•2h ago

These are going to be war machines, make absolutely no mistake about it. On-device autonomy is the perfect foil to escape centralized authority and accountability. There’s no human behind the drone to charge for war crimes. It’s what they’ve always dreamed of.

Who’s going to stop them? Who’s going to say no? The military contracts are too big to say no to, and they might not have a choice.

The elimination of toil will mean the elimination of humans all together. That’s where we’re headed. There will be no profitable life left for you, and you will be liquidated by “AI-Powered Automation for Every Decision”[0]. Every. Decision. It’s so transparent. The optimists in this thread are baffling.

0: https://www.palantir.com/

mateus1•2h ago

MIT spinoff Google-owned Boston Dynamics pledged not to militarize their robots. Which is very hard to believe given they're backed by DARPA, the DoD/Military investment arm.

jagger27•2h ago

Militarize is just bad marketing. Call them cleaning machines and put them to work on dirty things.

paxys•2h ago

Was owned by Google. Then Softbank. Now Hyundai.

JumpCrisscross•2h ago

> These are going to be war machines, make absolutely no mistake about it

Of course they will. Practically everything useful has a military application. I'm not sure why this is considered a hot take.

jagger27•1h ago

The difference between this machine and the ones that came before is that there won’t have to be a human in the loop to execute mass murder.

bamboozled•8m ago

How would these things be competitive with drones on the battlefield? They probably cost the equivalent of 1000 autonomous drones and 100x the time and materials to make, way more power would be required to make them work too.

Terminator is a good movie but in reality, a cheap autnomous drone would mess one of those up pretty good.

I've seen some of the footage from Ukraine, drones are deadly, efficient, they are terrifying on the battlefield.

polskibus•1h ago

What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?

KoolKat23•1h ago

Actually very close to one I'd say.

It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".

As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).

Natively multimodal LLM's are basically brains.

martythemaniak•47m ago

OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk

moelf•1h ago

The MuJoCo link actually points to https://github.com/google-deepmind/aloha_sim

Fun with uv and PEP 723

National Archives to restrict public access starting July 7

Writing toy software is a joy

ChatGPT's enterprise success against Copilot fuels OpenAI/Microsoft rivalry

Ancient X11 scaling technology

PlasticList – Plastic Levels in Foods

Analyzing a Critique of the AI 2027 Timeline Forecasts

Finding a 27-year-old easter egg in the Power Mac G3 ROM

Subsecond: A runtime hotpatching engine for Rust hot-reloading

XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

The bitter lesson is coming for tokenization

How to Think About Time in Programming

Starship: The minimal, fast, and customizable prompt for any shell

Gemini Robotics On-Device brings AI to local robotic devices

Basic Facts about GPUs

Show HN: Autumn – Open-source infra over Stripe

Expand.ai (YC S24) is hiring a founding engineer

Mapping LLMs over excel saved my passion for game dev

The economics behind "Basic Economy" – A masterclass in price discrimination

World Curling tightens sweeping rules, bans firmer broom foams ahead of Olympics

The German automotive industry wants to develop open-source software together

Nordic Semiconductor Acquires Memfault

Timdle – Place historical events in chronological order

PyTorch Reshaping with None

MCP is eating the world

Show HN: Oasis – an open-source, 3D-printed smart terrarium

Bridging Cinematic Principles and Generative AI for Automated Film Generation

SFStreets: History of San Francisco place names

How Cloudflare blocked a monumental 7.3 Tbps DDoS attack

Circular Microcomputers embedded and powered by repurposed smartphone components

Fun with uv and PEP 723

National Archives to restrict public access starting July 7

Writing toy software is a joy

ChatGPT's enterprise success against Copilot fuels OpenAI/Microsoft rivalry

Ancient X11 scaling technology

PlasticList – Plastic Levels in Foods

Analyzing a Critique of the AI 2027 Timeline Forecasts

Finding a 27-year-old easter egg in the Power Mac G3 ROM

Subsecond: A runtime hotpatching engine for Rust hot-reloading

XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

The bitter lesson is coming for tokenization

How to Think About Time in Programming

Starship: The minimal, fast, and customizable prompt for any shell

Gemini Robotics On-Device brings AI to local robotic devices

Basic Facts about GPUs

Show HN: Autumn – Open-source infra over Stripe

Expand.ai (YC S24) is hiring a founding engineer

Mapping LLMs over excel saved my passion for game dev

The economics behind "Basic Economy" – A masterclass in price discrimination

World Curling tightens sweeping rules, bans firmer broom foams ahead of Olympics

The German automotive industry wants to develop open-source software together

Nordic Semiconductor Acquires Memfault

Timdle – Place historical events in chronological order

PyTorch Reshaping with None

MCP is eating the world

Show HN: Oasis – an open-source, 3D-printed smart terrarium

Bridging Cinematic Principles and Generative AI for Automated Film Generation

SFStreets: History of San Francisco place names

How Cloudflare blocked a monumental 7.3 Tbps DDoS attack

Circular Microcomputers embedded and powered by repurposed smartphone components

Gemini Robotics On-Device brings AI to local robotic devices

Comments