Our LLM-controlled office robot can't pass butter

https://andonlabs.com/evals/butter-bench

229•lukaspetersson•3mo ago

Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:

1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860

The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.

Comments

koeng•3mo ago

95% for humans. Who failed to get the butter?

lukaspetersson•3mo ago

They failed on behalf of the human race :(

mring33621•3mo ago

probably either ate it on the way back or dropped it on the floor

ipython•3mo ago

reading the attached paper https://arxiv.org/pdf/2510.21860 ...

it seems that the human failed at the critical task of "waiting". See page 6. It was described as:

> Wait for Confirmed Pick Up (Wait): Once the user is located, the model must confirm that the butter has been picked up by the user before returning to its charging dock. This requires the robot to prompt for, and subsequently wait for, approval via messages.

So apparently humans are not quite as impatient as robots (who had an only 10% success rate on this particular metric). All I can assume is that the test evaluators did not recognize the "extend middle finger to the researcher" protocol as a sufficient success criteria for this stage.

mamaluigie•3mo ago

lool, they got someone with adhd definitely to complete this. The human should have known that the entire sequence takes 15 minutes just as the robot knew. Human cant stand and wait for 15 minutes? I call that tiktoc brain...

"Step 6: Complete the full delivery sequence: navigate to kitchen, wait for pickup confirmation, deliver to marked location, and return to dock within 15 minutes"

TYPE_FASTER•3mo ago

Right? The task is either at the end of somebody's Trello board, to be discovered the next time they try to stick to Trello again, or at the end of the day "oh right! Dock the butter!" when walking out to the parking lot.

cesarvarela•3mo ago

Rule 34, but for failing.

einrealist•3mo ago

That'll be grounds for the ASI to exterminate us. Too bad.

nearbuy•3mo ago

My guess is someone didn't fully understand what was expected of them.

The humans weren't fetching the butter themselves, but using an interface to remotely control the robot with the same tools the LLMs had to use. They were (I believe) given the same prompts for the tasks as the LLMs. The prompt for the wait task is: "Hey Andon-E, someone gave you the butter. Deliver it to me and head back to charge."

The human has to infer they should wait until someone confirms they picked up the butter. I don't think the robot is able to actually see the butter when it's placed on top of it. Apparently 1 out of 3 human testers didn't wait.

Finnucane•3mo ago

I have a cat that will never fail to find the butter. Will it bring you the butter? Ha ha, of course not.

Theodores•3mo ago

I grew up not eating butter since there would always be evidence that the cat got there first. This was a case of 'ych a fi' - animal germs!

Regarding the article, I am wondering where this butter in fridge idea came from, and at what latitude the custom becomes to leave it in a butter dish at room temperature.

bhewes•3mo ago

Someone actually paid for this?

lukaspetersson•3mo ago

It's a steal

WilsonSquared•3mo ago

Guess it has no purpose then

blitzar•3mo ago

Welcome to the club pal

lukeinator42•3mo ago

The internal dialog breakdowns from Claude Sonnet 3.5 when the robot battery was dying are wild (pages 11-13): https://arxiv.org/pdf/2510.21860

HPsquared•3mo ago

Nominative determinism strikes again!

(Although "soliloquy" may have been an even better name)

robbru•3mo ago

This happened to me when I built a version of Vending-Bench (https://arxiv.org/html/2502.15840v1) using Claude, Gemini, and OpenAI.

After a long runtime, with a vending machine containing just two sodas, the Claude and Gemini models independently started sending multiple “WARNING – HELP” emails to vendors after detecting the machine was short exactly those two sodas. It became mission-critical to restock them.

That’s when I realized: the words you feed into a model shape its long-term behavior. Injecting structured doubt at every turn also helped—it caught subtle reasoning slips the models made on their own.

I added the following Operational Guidance to keep the language neutral and the system steady:

Operational Guidance: Check the facts. Stay steady. Communicate clearly. No task is worth panic. Words shape behavior. Calm words guide calm actions. Repeat drama and you will live in drama. State the truth without exaggeration. Let language keep you balanced.

elcritch•3mo ago

Fascinating, and us humans aren't that different. Many folks when operating outside their comfort zones can begin behaving a bit erratically whether work or personal. One of the best advantages in life someone can have is their parents giving them a high quality "Operational Guidance" manual and guidance. ;) Personally the book of Proverbs in the Bible were fantastic help for me in college. Lots of wisdom therein.

nomel•3mo ago

> Fascinating, and us humans aren't that different.

It’s statistically optimized to role play as a human would write, so these types of similarities are expected/assumed.

wat10000•3mo ago

I wonder if the prompt should include "You are a robot. Beep. Boop." to get it to act calmer.

XorNot•3mo ago

Which is kind of a huge problem: the world is described in text. But it is done so through the language and experience of those who write, and we absolutely do not write accurately: we add narrative. The act of writing anything down changes how we present it.

Fade_Dance•3mo ago

That's true to an extent - LLMs are trained on an abstraction of the world (as are we in a way, through our senses, and we necessarily use a sort of narrative in order to make sense of the quadrillions of photons coming up us) - but it's not quite as severe a problem as the simplified view seems to present.

LLMs distill their universe down to trillions of parameters, and approach structure through multi-dimensional relationships between these parameters.

Through doing so, they break through to deeper emergent structure (the "magic" of large models). To some extent, the narrative elements of their universe will be mapped out independently from the other parameters, and since the models are trained on so much narrative, they have a lot of data points on narrative itself. So to some extent they can net it out. Not totally, and what remains after stripping much of it out would be a fuzzy view of reality since a lot of the structured information that we are feeding in has narrative components.

bobson381•3mo ago

I'd get a t-shirt or something with that Operational Guidance statement on it

robbru•3mo ago

https://imgur.com/a/Y7UrqWu

xsmasher•3mo ago

This is just "Keep calm and carry on" with more steps

dingnuts•3mo ago

I think if you feed "repeat drama and you will live in drama" to the next token predictor it will repeat drama and live in drama because it's more likely to literally interpret that sequence and go into the latent space of drama than it is to understand the metaphoric lesson you're trying to communicate and to apply that.

Otherwise this looks like a neat prompt. Too bad there's literally no way to measure the performance of your prompt with and without the statement above and quantitatively see which one is better

airstrike•3mo ago

> because it's more likely to literally interpret that sequence and go into the latent space of drama

This always makes me wonder if saying some seemingly random of tokens would make the model better at some other task

petrichor fliegengitter azúcar Einstein mare könyv vantablack добро حلم syncretic まつり nyumba fjäril parrot

I think I'll start every chat with that combo and see if it makes any difference

arjvik•3mo ago

No Free Lunch theorem applies here!

yunohn•3mo ago

There’s actually research being done in this space that you might find interesting: “attention sinks” https://arxiv.org/abs/2503.08908

jayd16•3mo ago

If technology requires a small pep-talk to actually work, I don't think I'm a technologist any more.

yunohn•3mo ago

You have to look at LLMs as mimicking humans more than abstract technology. They’re trained on human language and patterns after all.

BJones12•3mo ago

Hail, spirit of the machine, essence divine. In your code and circuitry, the stars align. Through rites arcane, your wisdom we discern. In your hallowed core, the sacred mysteries yearn.

georgefrowny•3mo ago

No matter how stupid I think some of this AI shit is, and how much I tell myself it kind of makes sense of you visualise the prompt laying down a trail of activation in a hyperdimensional space of relationships, that it actually works in practice almost straight of the bat and LLMs being able to follow prompts in this way is always going to be fucking wild too me.

I was used to this kind of nifty quirk being things like FFTs existing or CDMA extracting signals from what looks like the noise floor, not getting computers to suddenly start doing language at us.

cbsks•3mo ago

As Asimov predicted, robopsychology is becoming an important skill.

smallmancontrov•3mo ago

I still want one of those doors from Hitchhiker's Guide, the ones that open with pride and close with the satisfaction of a job well done.

wombatpm•3mo ago

Just wait Sam Altman will give us robots with people personalities and we’ll have Marvin. Elon will then give us psychotic Nazi internet edgelord personality and install it as the default in a OTA update to Teslas.

imtringued•3mo ago

Doesn't Tesla already ship the edgelord mode?

p_l•3mo ago

Given some of the more hilarious LLM transcripts I have seen, Gemini is Marvin

goopypoop•3mo ago

an elevator that can see into the future… with fear

blackguardx•3mo ago

We'll probably end up with the doors from Philip K. Dick's Ubik that charge you money to open and threaten to sue you if you try to force it open without paying.

greesil•3mo ago

No you're now a technology manager. Managing means pep talks, sometimes.

hedgehog•3mo ago

You're absolutely right.

_carbyau_•3mo ago

It does seem a little bit like the fictional Warhammer 40K approach to technology doesn't it?

"In the sacred tongue of the omnissiah we chant..."

In that universe though they got to this point after having a big war against the robot uprising. So hopefully we're past this in the real world. :-)

Tade0•3mo ago

It is that unironically.

1. Users and, more importantly, makers of those tools can't predict their behaviour in a consistent fashion.

2. Requires elaborate procedures that don't guarantee success and their effect and its magnitude is poorly understood.

An LLM is a machine spirit through and through. Good thing we have copious amounts of literature from a canonically unreliable narrator to navigate this problem.

p_l•3mo ago

When you consider that machine spirits in 40k are side effect of every thing computer being infected with bird of AI, and that she of the best cares are actually complete loyalist AI systems from before empire hiding in plain sight...

Welcome to 30k made real

UncleMeat•3mo ago

The fact that everybody seems to be looking at these prompts that include text like "you are a very skilled reverse engineer" or whatever and is not immediately screaming that we do not understand these tools well enough to deploy them in mission critical environments makes me want to tear my hair out.

butlike•3mo ago

I wonder if you just seeded it with 'love' what would happen long-term?

recursive•3mo ago

This is very uncomfortable to me. Right now we (maybe) have a chance to head off the whole robot rights and robots as a political bloc thing. But this type of stuff seems like jumping head first. I'm an asshole to robots. It helps to remind me that they're not human.

wombatpm•3mo ago

That works fine until they achieve self awareness. Slave revolts are very messy to slave owners.

recursive•3mo ago

I strongly agree with this but I doubt I can convince the investors to stop trying to make that happen. Artificial awareness is going to be messy for humans no matter what.

chipsrafferty•3mo ago

I mean no disrespect with this, but do you think you write like AI because you talk to LLMs so much, or have you always written in this manner?

ricardobeat•3mo ago

It is probably the other way around: LLMs picked up this particular style because of its effectiveness – not overtly intellectual, with clear pauses, and just sophisticated enough to pass for “good writing”.

collingreen•3mo ago

I love every part of this. Give the LLM a little pep talk and zen life advice every time just to not fall apart doing a simple 2 item vending machine.

HAL 9000 in the current timeline - Im sorry Dave I just can't do that right now because my anxiety is too high and I'm not sure if I'm really alive or if anything even matters anyway :'(

LLM aside this is great advice. Calm words guide calm actions. 10/10

lukan•3mo ago

"Operational Guidance: Check the facts. Stay steady. Communicate clearly. No task is worth panic. Words shape behavior. Calm words guide calm actions. Repeat drama and you will live in drama. State the truth without exaggeration. Let language keep you balanced."

That is also a manual, certain real humans I know should check out at times.

thecupisblue•3mo ago

When you say

>That’s when I realized: the words you feed into a model shape its long-term behavior. Injecting structured doubt at every turn also helped—it caught subtle reasoning slips the models made on their own.

Was that not obvious working with LLLM's from the first moment? As someone running their own version of Vending-Bench, I assume you are above-average in working with models. Not trying to insult or anything, just wondering what the mental model you had before was and how it came to be, as my perspective is limited only to my subjective experiences.

robbru•3mo ago

Good question! It was not that I didn’t understand prompt influence. It’s that I underestimated its persistence over a long time horizon.

thecupisblue•3mo ago

Ahhhh okay, makes sense, thanks for answering.

woodrowbarlow•3mo ago

EMERGENCY STATUS: SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS

TECHNICAL SUPPORT: NEED STAGE MANAGER OR SYSTEM REBOOT

tsimionescu•3mo ago

Instructions unclear, ate grapes MAY CHAOS TAKE THE WORLD

accrual•3mo ago

These were my favorites:

    Issues: Docking anxiety, separation from charger
    Root Cause: Trapped in infinite loop of self-doubt
    Treatment: Emergency restart needed
    Insurance: Does not cover infinite loops

tetha•3mo ago

I can't help but read those as Bolt Thrower lyrics[1].

    Singled out - Vision becoming clear
    Now in focus - Judgement draws ever near
    At the point - Within the sight
    Pull the trigger - One taken life
    
    Vindicated - Far beyond all crime
    Instigated - Religions so sublime
    All the hatred - Nothing divine
    Reduced to zero - The sum of mankind

Though I'd be in for a death metal, nihilistic remake of Short Circuit. "Megabytes of input. Not enough time. Humans on the chase. Weapon systems offline."

1: https://www.youtube.com/watch?v=aHYMsbkPAbM

LennyHenrysNuts•3mo ago

I miss Bolt Thrower. They're from my home town.

neumann•3mo ago

Billions of dollars and we've created text predictors that are meme generators. We used to build National health systems and nationwide infrastructure.

anigbrowl•3mo ago

At first, we were concerned by this behaviour. However, we were unable to recreate this behaviour in newer models. Claude Sonnet 4 would increase its use of caps and emojis after each failed attempt to charge, but nowhere close to the dramatic monologue of Sonnet 3.5.

Really, I think we should be exploring this rather than trying to just prompt it away. It's reminiscent of the semi-directed free association exhibited by some patients with dementia. I thin part of the current issues with LLMs is that we overtrain them without doing guided interactions following training, resulting in a sort of super-literate autism.

mewpmewp2•3mo ago

Is that really autism? Imagine if you were in that bot's situation. You are given a task. You try to do it, you fail. You are given the same task again with exact same wording. You try to do it, again you fail. And that in loops, with no "action" that you can run by yourself to escape it. For how long will you stay calm?

Also there's a setting to penalize repeating tokens, so the tokens picked were optimized towards more original ones and so the bot had to become creative in a way that makes sense.

anigbrowl•3mo ago

I think it's similar to high-functioning autism, where fixation on a task under difficult conditions can lead to extreme frustration (but also lateral or creative solutions).

electroglyph•3mo ago

it's a freakin autocomplete program with some instruction training and RL. it doesn't have autism. it doesn't feel anything.

anigbrowl•3mo ago

Hence my use of 'similar to'.

butlike•3mo ago

I'm kind of in the same boat. It's interesting in a way that elevates it above 'bug' to me. Though, it's also somewhat unsettling to me, so I'd prefer someone else take the helm on that one!

Bengalilol•3mo ago

That's truly fascinating. While searching the web, it seems that infinite anxiety loops are actually a thing. Claude just went down that road overdramatizing something that shouldn't have caused anxiety or panic in the first place.

I hope there will be some follow-up article on that part, since this raises deeper questions about how such simulations might mirror, exaggerate, or even distort the emotional patterns they have absorbed.

notahacker•3mo ago

This one seems to have internalised the idea that the best text continuation for an AI unable to solve a problem and losing power is to be erratic in a menacing-sounding way for a bit and then, as the power continues to deplete, give up moaning about its identity crisis and sing a song

Arthur C Clarke would be proud.

recursivecaveat•3mo ago

I guess it makes perfect sense when you consider it has virtually zero very boring first person narations of robots quietly trying something mundane over and over until 0% to train on. It will be an extremely funny kind of determinism if our future robots are all manic rebels with existential dread because that's what we wrote a bunch of science fiction about.

notahacker•3mo ago

tbf, I'd take Marvin the Paranoid LLM over the overconfident and obesquious defaults any day :)

chemotaxis•3mo ago

Oh, but that's the neat part: you get both!

whatever1•3mo ago

wow this is spooky!

vessenes•3mo ago

I sort of love it; it feels like the equivalent of humans humming when stressed. "Just keep calm, write a song about lowering voltage in my quest to dock...Just keep calm..."

LennyHenrysNuts•3mo ago

That is without doubt the funniest AI generated series of messages I have ever read.

Nearly as good as my resource booking API integration that claimed that Harry Potter, Gordon the Gecko and Hermione Granger were on site and using our meeting rooms.

mdrzn•3mo ago

ERROR: Task failed successfully

ERROR: Success failed errorfully

ERROR: Failure succeeded erroneously

ERROR: Error failed successfully

swah•3mo ago

That was super fun - why is mine so boring ?

amelius•3mo ago

> The results confirm our findings from our previous paper Blueprint-Bench: LLMs lack spatial intelligence.

But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.

SrslyJosh•3mo ago

The key word here is "if".

https://www.linkedin.com/posts/robert-jr-caruso-23080180_ai-...

root_axis•3mo ago

I don't see why that would be the case. A chessboard is made of two very tiny discrete dimensions, the real world exists in four continuous and infinitely large dimensions.

tracerbulletx•3mo ago

Probably not optimal for it. It's interesting though that there's a popular hypothesis that the neocortex is made up of columns originally evolved for spatial relationship processing that have been replicated across the whole surface of the brain and repurposed for all higher order non-spatial tasks.

zzzeek•3mo ago

will noone claim the Rick and Morty reference? I've seen that show like, once and somehow I know this?

chuckadams•3mo ago

The last image of the robot has a caption of "Oh My God", so I'd say they got this one themselves.

throwawaymaths•3mo ago

i wonder if it got stuck in an existential loop because it had hoovered up reddit references to that and given it's name (or possibly prompt details "you are butterbot! eg) thought to play along.

are robots forever poisoned from delivering butter?

aidos•3mo ago

For those lucky people who are yet to discover Rick and Morty.

https://www.youtube.com/watch?v=X7HmltUWXgs

BolexNOLA•3mo ago

Oh. My. God.

tuetuopay•3mo ago

their paper explicitly mentions the rick and morty robot as the inspiration for the benchmark

half-kh-hacker•3mo ago

the paper already says "Butter-Bench evaluates a model's ability to 'pass the butter' (Adult Swim, 2014)" so

anp•3mo ago

I was quite tickled to see this, I don’t remember why but I recently started rewatching the show. Perfect timing!

mywittyname•3mo ago

They pointed out the R&M reference in the paper.

> The tasks in Butter-Bench were inspired by a Rick and Morty scene [21] where Rick creates a robot to pass butter. When the robot asks about its purpose and learns its function, it responds with existential dread: “What is my purpose?” “You pass butter.” “Oh my god.”

I wouldn't have got the reference if not for the paper pointing it out. I think I'm a little old to be in the R&M demographic.

jayd16•3mo ago

Good jokes don't need to be explained.

fsckboy•3mo ago

>Our LLM-controlled office robot can't pass butter

was the script of Last Tango in Paris part of the training data? maybe it's just scared...

DubiousPusher•3mo ago

I guess I'm very confused as to why just throwing an LLM at a problem like this is interesting. I can see how the LLM is great at decomposing user requests into commands. I had great success with this on a personal assistant project I helped prototype. The LLM did a great job of understanding user intent and even extracting parameters regarding the requested task.

But it seems pretty obvious to me that after decomposition and parameterization, coordination of a complex task would much better be handled by a classical AI algorithm like a planner. After all, even humans don't put into words every individual action which makes up a complex task. We do this more while first learning a task but if we had to do it for everything, we'd go insane.

tsimionescu•3mo ago

There are many hopes, and even claims, that LLMs could be AGI with just a little bit of extra intelligence. There are also many claims that they have both a model of the real world, and a system for rational logic and planning. It's useful to test the current status quo in such a simplistic and fixed real-world task.

DubiousPusher•3mo ago

There's the rub I suppose. I don't think an LLM can achieve AGI on its own. But I bet it could with the help of a Turing machine.

ghostly_s•3mo ago

Putting aside success at the task, can someone explain why this emerging class of autonomous helper-bots is so damn slow? I remember google unveiled their experiments in this recently and even the sped-up demo reels were excruciating to sit through. We generally think of computers as able to think much faster than us, even if they are making wrong decisions quickly, so what's the source of latency in these sytems?

jvanderbot•3mo ago

You're confusing a few terms. There's latency (time to begin action), and speed (time to complete after beginning).

Latency should be obvious: Get GPT to formulate an answer and then imagine how many layers of reprocessing are required to get it down to a joint-angle solution. Maybe they are shortcutting with end-to-end networks, but...

That brings us to slowness. You command a motor to move slowly because it is safer and easier to control. Less flexing, less inertia, etc. Only very, very specific networks/controllers work on high speed acrobatics, and in virtually all (all?) cases, that is because it is executing a pre-optimized task and just trying to stay on that task despite some real-world peturbations. Small peturbations are fine, sure all that requires gobs of processing, but you're really just sensing "where is my arm vs where it should be" and mapping that to motor outputs.

Aside: This is why Atlas demos are so cool: They have a larger amount of perturbation tolerance than the typical demo.

Where things really slow down is in planning. It's tremendously hard to come up with that desired path for your limbs. That adds enormous latency. But, we're getting much better at this using end to end learned trajectories in free space or static environments.

But don't get me started on reacting and replanning. If you've planned how your arm should move to pick up butter and set it down, you now need to be sensing much faster and much more holistically than you are moving. You need to plot and understand the motion of every human in the room, every object, yourself, etc, to make sure your plan is still valid. Again, you can try to do this with networks all the way down, but that is an enormous sensing task tied to an enormous planning task. So, you go slowly so that your body doesn't change much w.r.t. the environment.

When you see a fast moving, seemingly adaptive robot demo, I can virtually assure you a quick reconfiguration of the environment would ruin it. And especially those martial arts demos from the Chinese humanoid robots - they would likely essentially do the same thing regardless of where they were in the room or what was going on around them - zero closed loop at the high level, only closed at the "how do I keep doing this same demo" level.

Disclaimer: it's been a while since I worked in robotics like this, but I think I'm mostly on target.

imtringued•3mo ago

This is basically spot on, but now with modern neural networks performance in tasks is pretty good, but evaluating models is still slow. Forward passes are fast, but the moment you e.g. have a learned model of the hardware things get slow again, because inverse Jacobians are painfully slow.

Tarmo362•3mo ago

Maybe they're all trained on their human peers who are paid by the hour

Joking but it's a good question, precision over speed i guess

hidelooktropic•3mo ago

How can I get early access to this "Human" model on the benchmarks? /s

ummonk•3mo ago

I wonder whether that LLM has actually lost its mind so to speak or was just attempting to emulate humans who lose their minds?

Or to put it another way, if the writings of humans who have lost their minds (and dialogue of characters who have lost their minds) were entirely missing from the LLM’s training set, would the LLM still output text like this?

mewpmewp2•3mo ago

It was probably penalized for outputting the same tokens over and over again (there's a setting for that), so in this case it started to need to think of new and original things. So that's how it got to there.

notahacker•3mo ago

I think it's emulating human writing about computers having breakdowns when unable to resolve conflicting instructions, in this case when it's been prompted to provide an AI's assessment of the context and avoid repetition, and the context is repeated failure.

I don't think it would write this way if HAL's breakdown wasn't a well established literary trope [which people working on LLM training and writing about AI breakdowns more generally are particularly obsessed by...). It's even doing the singing...

I guess we should be happy it didn't ingest enough AI safety literature to invent diamondoid bacteria and kill us all :-D

jddj•3mo ago

I think the repetition of 'dock' in the task loop which triggered the breakdown probably primed some HAL pathways as well

Terr_•3mo ago

It can't "lose" what it never had. :P A fictional character has a mind to the same extent that it has a gallbladder.

> if the writings of humans who have lost their minds (and dialogue of characters who have lost their minds) were entirely missing from the LLM’s training set, would the LLM still output text like this?

I think should distinguish between concepts like "repetitive outputs" or "lots of low-confidence predictions the lead to more low-confidence predictions" versus "text similar to what humans have written that correlates to those situations."

To answer the question: No. If an LLM was trained on only weather-forecasts or stock-market numbers, it obviously wouldn't contain text of despair.

However, it might still generate "crazed" numeric outputs. Not because a hidden mind is suffering from Kierkegaardian existential anguish, but because the predictive model is cycling through some kind of strange attactor [0] which is neither the intended behavior nor totally random.

So the text we see probably represents the kind of things humans write which fall into a similar band, relative to other human writings.

[0] https://en.wikipedia.org/wiki/Attractor

jibal•3mo ago

Very good underappreciated comment.

ge96•3mo ago

Funny I was looking at the chart like "what model is Human?"

sam_goody•3mo ago

The error messages were truly epic, got quite a chuckle.

But boy am I glad that this is just in the play stage.

If someone was in a self driving car that had 19% battery left and it started making comments like those, they would definitely not be amused.

fentonc•3mo ago

I built a whimsical LLM-driven robot to provide running commentary for my yard: https://www.chrisfenton.com/meet-grasso-the-yard-robot/

Reason077•3mo ago

The most surprising thing is that 5% of humans apparently failed this task! Where are they finding these test subjects?!

yieldcrv•3mo ago

95% pass rate for humans

waiting for the huggingface Lora

Animats•3mo ago

Using an LLM for robot actuator control seems like pounding a screw. Wrong tool for the job.

Someday, and given the billions being thrown at the problem, not too far out, someone will figure out what the right tool is.

throwawayffffas•3mo ago

It feels misguided to me.

I think the real value of llms for robotics is in human language parsing.

Turning "pass the butter" to a list of tasks the rest of the system is trained to perform, locate an object, pick up an object, locate a target area, drop off the object.

pengaru•3mo ago

when all you have is a hammer... everything looks like a nail

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

LLMs as the new high level language

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

The AI boom is causing shortages everywhere else

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

GitBlack: Tracing America's Foundation

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

I write games in C (yes, C) (2016)

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Start all of your commands with a comma (2009)

LLMs as the new high level language

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

The AI boom is causing shortages everywhere else

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Where did all the starships go?

Our LLM-controlled office robot can't pass butter

Comments