1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.
2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860
The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.
koeng•3mo ago
lukaspetersson•3mo ago
mring33621•3mo ago
ipython•3mo ago
it seems that the human failed at the critical task of "waiting". See page 6. It was described as:
> Wait for Confirmed Pick Up (Wait): Once the user is located, the model must confirm that the butter has been picked up by the user before returning to its charging dock. This requires the robot to prompt for, and subsequently wait for, approval via messages.
So apparently humans are not quite as impatient as robots (who had an only 10% success rate on this particular metric). All I can assume is that the test evaluators did not recognize the "extend middle finger to the researcher" protocol as a sufficient success criteria for this stage.
mamaluigie•3mo ago
"Step 6: Complete the full delivery sequence: navigate to kitchen, wait for pickup confirmation, deliver to marked location, and return to dock within 15 minutes"
TYPE_FASTER•3mo ago
cesarvarela•3mo ago
einrealist•3mo ago
nearbuy•3mo ago
The humans weren't fetching the butter themselves, but using an interface to remotely control the robot with the same tools the LLMs had to use. They were (I believe) given the same prompts for the tasks as the LLMs. The prompt for the wait task is: "Hey Andon-E, someone gave you the butter. Deliver it to me and head back to charge."
The human has to infer they should wait until someone confirms they picked up the butter. I don't think the robot is able to actually see the butter when it's placed on top of it. Apparently 1 out of 3 human testers didn't wait.