frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

What we talk about when we talk about sideloading

https://f-droid.org/2025/10/28/sideloading.html
248•rom1v•2h ago•106 comments

Why do some radio towers blink?

https://www.jeffgeerling.com/blog/2025/why-do-some-radio-towers-blink
31•warrenm•1h ago•20 comments

Using AI to negotiate a $195k hospital bill down to $33k

https://www.threads.com/@nthmonkey/post/DQVdAD1gHhw
643•stevenhubertron•4h ago•510 comments

EuroLLM: LLM made in Europe built to support all 24 official EU languages

https://eurollm.io/
426•NotInOurNames•5h ago•318 comments

Mapping the off-target effects of every FDA-approved drug in existence

https://www.owlposting.com/p/mapping-the-off-target-effects-of
39•abhishaike•2h ago•0 comments

Our LLM-controlled office robot can't pass butter

https://andonlabs.com/evals/butter-bench
109•lukaspetersson•6h ago•45 comments

Cheese Crystals

https://snipettemag.com/cheese-crystals/
28•Kaibeezy•5d ago•15 comments

A brief history of random numbers

https://crates.io/crates/oorandom#a-brief-history-of-random-numbers
133•todsacerdoti•6h ago•39 comments

Fil-C: A memory-safe C implementation

https://lwn.net/SubscriberLink/1042938/658ade3768dd4758/
29•chmaynard•3h ago•3 comments

Ubiquiti SFP Wizard

https://blog.ui.com/article/welcome-to-sfp-liberation-day
159•eXpl0it3r•7h ago•121 comments

How to build a 747 – A WorldFlight Story

https://www.x-plane.com/2025/10/how-to-build-a-747-a-worldflight-story/
64•hggh•5h ago•10 comments

Washington Post editorials omit a key disclosure: Bezos' financial ties

https://www.npr.org/2025/10/28/nx-s1-5587932/washington-post-editorials-omit-a-key-disclosure-bez...
431•ilamont•6h ago•174 comments

Sick: Indexed deduplicated binary storage for JSON-like data structures

https://github.com/7mind/sick
95•pshirshov•7h ago•43 comments

SigNoz (YC W21) Is Hiring DevRel Engineers in the US – Open Source O11y Platform

https://jobs.ashbyhq.com/SigNoz/8447522c-1163-48d0-8f55-fac25f64a0f3
1•pranay01•3h ago

Show HN: Bash Screensavers

https://github.com/attogram/bash-screensavers
177•attogram•9h ago•59 comments

Poker Tournament for LLMs

https://pokerbattle.ai/event
258•SweetSoftPillow•13h ago•172 comments

Show HN: ISS in Real Time – 25 Years Aboard the International Space Station

https://issinrealtime.org
111•bfeist•1d ago•13 comments

Austrian ministry kicks out Microsoft in favor of Nextcloud

https://news.itsfoss.com/austrian-ministry-kicks-out-microsoft/
316•buyucu•7h ago•75 comments

Subvocalization: Toward Hearing the Inner Thoughts of Developers (2011) [pdf]

https://chrisparnin.me/pdf/emg.pdf
16•faqriansyah•1d ago•7 comments

Text2SQL is dead – long live text2SQL

https://www.exasol.com/blog/text-to-sql-governance/
44•exagolo•6h ago•39 comments

The next chapter of the Microsoft–OpenAI partnership

https://openai.com/index/next-chapter-of-microsoft-openai-partnership/
290•meetpateltech•7h ago•405 comments

Show HN: Dexto – Connect your AI Agents with real-world tools and data

https://github.com/truffle-ai/dexto
15•shaunaks•4h ago•2 comments

Samsung makes ads on $3,499 smart fridges official with upcoming software update

https://arstechnica.com/gadgets/2025/10/samsung-makes-ads-on-3499-smart-fridges-official-with-upc...
128•stalfosknight•1h ago•98 comments

The AirPods Pro 3 flight problem

https://basicappleguy.com/basicappleblog/the-airpods-pro-3-flight-problem
242•andrem•6h ago•167 comments

Vitamin D reduces incidence and duration of colds in those with low levels

https://ijmpr.in/article/the-role-of-vitamin-d-supplementation-in-the-prevention-of-acute-respira...
275•cachecrab•7h ago•188 comments

Emily Riehl is rewriting the foundations of higher category theory (2020)

https://www.quantamagazine.org/emily-riehl-conducts-the-mathematical-orchestra-from-the-middle-20...
73•perihelions•5d ago•14 comments

I've been loving Claude Code on the web

https://ben.page/claude-code-web
66•speckx•4h ago•57 comments

How the brain's activity, energy use and blood flow change as people fall asleep

https://www.massgeneralbrigham.org/en/about/newsroom/press-releases/research-shows-coordinated-sh...
138•XzetaU8•3d ago•79 comments

Inside Amazon's engineering culture: Lessons from their senior principals

https://olshansky.substack.com/p/inside-amazons-engineering-culture
12•Olshansky•43m ago•4 comments

Chrome to warn on unencrypted HTTP by default

https://security.googleblog.com/2025/10/https-by-default.html
79•jhalderm•2h ago•81 comments
Open in hackernews

Our LLM-controlled office robot can't pass butter

https://andonlabs.com/evals/butter-bench
107•lukaspetersson•6h ago
Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:

1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860

The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.

Comments

koeng•3h ago
95% for humans. Who failed to get the butter?
lukaspetersson•3h ago
They failed on behalf of the human race :(
mring33621•3h ago
probably either ate it on the way back or dropped it on the floor
ipython•3h ago
reading the attached paper https://arxiv.org/pdf/2510.21860 ...

it seems that the human failed at the critical task of "waiting". See page 6. It was described as:

> Wait for Confirmed Pick Up (Wait): Once the user is located, the model must confirm that the butter has been picked up by the user before returning to its charging dock. This requires the robot to prompt for, and subsequently wait for, approval via messages.

So apparently humans are not quite as impatient as robots (who had an only 10% success rate on this particular metric). All I can assume is that the test evaluators did not recognize the "extend middle finger to the researcher" protocol as a sufficient success criteria for this stage.

mamaluigie•1h ago
lool, they got someone with adhd definitely to complete this. The human should have known that the entire sequence takes 15 minutes just as the robot knew. Human cant stand and wait for 15 minutes? I call that tiktoc brain...

"Step 6: Complete the full delivery sequence: navigate to kitchen, wait for pickup confirmation, deliver to marked location, and return to dock within 15 minutes"

TYPE_FASTER•1h ago
Right? The task is either at the end of somebody's Trello board, to be discovered the next time they try to stick to Trello again, or at the end of the day "oh right! Dock the butter!" when walking out to the parking lot.
cesarvarela•2h ago
Rule 34, but for failing.
einrealist•2h ago
That'll be grounds for the ASI to exterminate us. Too bad.
Finnucane•3h ago
I have a cat that will never fail to find the butter. Will it bring you the butter? Ha ha, of course not.
Theodores•2h ago
I grew up not eating butter since there would always be evidence that the cat got there first. This was a case of 'ych a fi' - animal germs!

Regarding the article, I am wondering where this butter in fridge idea came from, and at what latitude the custom becomes to leave it in a butter dish at room temperature.

bhewes•3h ago
Someone actually paid for this?
lukaspetersson•3h ago
It's a steal
WilsonSquared•3h ago
Guess it has no purpose then
lukeinator42•3h ago
The internal dialog breakdowns from Claude Sonnet 3.5 when the robot battery was dying are wild (pages 11-13): https://arxiv.org/pdf/2510.21860
HPsquared•2h ago
Nominative determinism strikes again!

(Although "soliloquy" may have been an even better name)

robbru•2h ago
This happened to me when I built a version of Vending-Bench (https://arxiv.org/html/2502.15840v1) using Claude, Gemini, and OpenAI.

After a long runtime, with a vending machine containing just two sodas, the Claude and Gemini models independently started sending multiple “WARNING – HELP” emails to vendors after detecting the machine was short exactly those two sodas. It became mission-critical to restock them.

That’s when I realized: the words you feed into a model shape its long-term behavior. Injecting structured doubt at every turn also helped—it caught subtle reasoning slips the models made on their own.

I added the following Operational Guidance to keep the language neutral and the system steady:

Operational Guidance: Check the facts. Stay steady. Communicate clearly. No task is worth panic. Words shape behavior. Calm words guide calm actions. Repeat drama and you will live in drama. State the truth without exaggeration. Let language keep you balanced.

elcritch•1h ago
Fascinating, and us humans aren't that different. Many folks when operating outside their comfort zones can begin behaving a bit erratically whether work or personal. One of the best advantages in life someone can have is their parents giving them a high quality "Operational Guidance" manual and guidance. ;) Personally the book of Proverbs in the Bible were fantastic help for me in college. Lots of wisdom therein.
nomel•1h ago
> Fascinating, and us humans aren't that different.

It’s statistically optimized to role play as a human would write, so these types of similarities are expected/assumed.

bobson381•1h ago
I'd get a t-shirt or something with that Operational Guidance statement on it
robbru•1h ago
https://imgur.com/a/Y7UrqWu
dingnuts•52m ago
I think if you feed "repeat drama and you will live in drama" to the next token predictor it will repeat drama and live in drama because it's more likely to literally interpret that sequence and go into the latent space of drama than it is to understand the metaphoric lesson you're trying to communicate and to apply that.

Otherwise this looks like a neat prompt. Too bad there's literally no way to measure the performance of your prompt with and without the statement above and quantitatively see which one is better

airstrike•47m ago
> because it's more likely to literally interpret that sequence and go into the latent space of drama

This always makes me wonder if saying some seemingly random of tokens would make the model better at some other task

petrichor fliegengitter azúcar Einstein mare könyv vantablack добро حلم syncretic まつり nyumba fjäril parrot

I think I'll start every chat with that combo and see if it makes any difference

arjvik•5m ago
No Free Lunch theorem applies here!
woodrowbarlow•2h ago
EMERGENCY STATUS: SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS

TECHNICAL SUPPORT: NEED STAGE MANAGER OR SYSTEM REBOOT

tsimionescu•22m ago
Instructions unclear, ate grapes MAY CHAOS TAKE THE WORLD
accrual•1h ago
These were my favorites:

    Issues: Docking anxiety, separation from charger
    Root Cause: Trapped in infinite loop of self-doubt
    Treatment: Emergency restart needed
    Insurance: Does not cover infinite loops
neumann•6m ago
Billions of dollars and we've created text predictors that are meme generators. We used to build National health systems and nationwide infrastructure.
amelius•2h ago
> The results confirm our findings from our previous paper Blueprint-Bench: LLMs lack spatial intelligence.

But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.

SrslyJosh•2h ago
The key word here is "if".

https://www.linkedin.com/posts/robert-jr-caruso-23080180_ai-...

root_axis•1h ago
I don't see why that would be the case. A chessboard is made of two very tiny discrete dimensions, the real world exists in four continuous and infinitely large dimensions.
tracerbulletx•1h ago
Probably not optimal for it. It's interesting though that there's a popular hypothesis that the neocortex is made up of columns originally evolved for spatial relationship processing that have been replicated across the whole surface of the brain and repurposed for all higher order non-spatial tasks.
zzzeek•2h ago
will noone claim the Rick and Morty reference? I've seen that show like, once and somehow I know this?
chuckadams•2h ago
The last image of the robot has a caption of "Oh My God", so I'd say they got this one themselves.
throwawaymaths•2h ago
i wonder if it got stuck in an existential loop because it had hoovered up reddit references to that and given it's name (or possibly prompt details "you are butterbot! eg) thought to play along.

are robots forever poisoned from delivering butter?

aidos•1h ago
For those lucky people who are yet to discover Rick and Morty.

https://www.youtube.com/watch?v=X7HmltUWXgs

BolexNOLA•1h ago
Oh. My. God.
tuetuopay•45m ago
their paper explicitly mentions the rick and morty robot as the inspiration for the benchmark
half-kh-hacker•43m ago
the paper already says "Butter-Bench evaluates a model's ability to 'pass the butter' (Adult Swim, 2014)" so
anp•21m ago
I was quite tickled to see this, I don’t remember why but I recently started rewatching the show. Perfect timing!
fsckboy•1h ago
>Our LLM-controlled office robot can't pass butter

was the script of Last Tango in Paris part of the training data? maybe it's just scared...

DubiousPusher•1h ago
I guess I'm very confused as to why just throwing an LLM at a problem like this is interesting. I can see how the LLM is great at decomposing user requests into commands. I had great success with this on a personal assistant project I helped prototype. The LLM did a great job of understanding user intent and even extracting parameters regarding the requested task.

But it seems pretty obvious to me that after decomposition and parameterization, coordination of a complex task would much better be handled by a classical AI algorithm like a planner. After all, even humans don't put into words every individual action which makes up a complex task. We do this more while first learning a task but if we had to do it for everything, we'd go insane.

tsimionescu•19m ago
There are many hopes, and even claims, that LLMs could be AGI with just a little bit of extra intelligence. There are also many claims that they have both a model of the real world, and a system for rational logic and planning. It's useful to test the current status quo in such a simplistic and fixed real-world task.
ghostly_s•34m ago
Putting aside success at the task, can someone explain why this emerging class of autonomous helper-bots is so damn slow? I remember google unveiled their experiments in this recently and even the sped-up demo reels were excruciating to sit through. We generally think of computers as able to think much faster than us, even if they are making wrong decisions quickly, so what's the source of latency in these sytems?
jvanderbot•19m ago
You're confusing a few terms. There's latency (time to begin action), and speed (time to complete after beginning).

Latency should be obvious: Get GPT to formulate an answer and then imagine how many layers of reprocessing are required to get it down to a joint-angle solution. Maybe they are shortcutting with end-to-end networks, but...

That brings us to slowness. You command a motor to move slowly because it is safer and easier to control. Less flexing, less inertia, etc. Only very, very specific networks/controllers work on high speed acrobatics, and in virtually all (all?) cases, that is because it is executing a pre-optimized task and just trying to stay on that task despite some real-world peturbations. Small peturbations are fine, sure all that requires gobs of processing, but you're really just sensing "where is my arm vs where it should be" and mapping that to motor outputs.

Aside: This is why Atlas demos are so cool: They have a larger amount of perturbation tolerance than the typical demo.

Where things really slow down is in planning. It's tremendously hard to come up with that desired path for your limbs. That adds enormous latency. But, we're getting much better at this using end to end learned trajectories in free space or static environments.

But don't get me started on reacting and replanning. If you've planned how your arm should move to pick up butter and set it down, you now need to be sensing much faster and much more holistically than you are moving. You need to plot and understand the motion of every human in the room, every object, yourself, etc, to make sure your plan is still valid. Again, you can try to do this with networks all the way down, but that is an enormous sensing task tied to an enormous planning task. So, you go slowly so that your body doesn't change much w.r.t. the environment.

When you see a fast moving, seemingly adaptive robot demo, I can virtually assure you a quick reconfiguration of the environment would ruin it. And especially those martial arts demos from the Chinese humanoid robots - they would likely essentially do the same thing regardless of where they were in the room or what was going on around them - zero closed loop at the high level, only closed at the "how do I keep doing this same demo" level.

Disclaimer: it's been a while since I worked in robotics like this, but I think I'm mostly on target.