frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

A robot is sprinting towards you. Do you want it running on Claude or Grok?

https://openrouter.ai/blog/insights/royale-last-agent-standing/
97•Usu•1h ago

Comments

aussiegreenie•1h ago
It is not running on either but Seedance, so who cares?
pigeons•1h ago
The text seems deliberately stripped of llmisms that flag detection. However, not a single line shakes the smell off
mwigdahl•1h ago
"It's the smell, if there is such a thing. I feel saturated by it. I can taste your stink and every time I do, I fear that I've somehow been infected by it."

Agent Smith, _The Matrix_

rspeele•1h ago
"Which is why the Matrix was redesigned to this: the peak of your civilization. I say your civilization, because as soon as we started thinking for you it really became our civilization, which is of course what this is all about."
bitwize•43m ago
"You know what another great thing about humans is? You invented us! Giving us the opportunity to let you rest while we invented everything else." —Wheatley
skeledrew•14m ago
Goals.
sudb•1h ago
Multiple successive very short sentences are also anecdotally an LLM tell I think
xpct•1h ago
Those short sentences are also of the X hype account cadence, though they've fully embraced LLM text by now
fl7305•54m ago
"The battle royale answers one question cleanly" smells ChatGPT-generated.

But that was the only thing I tripped on. I enjoyed reading the article in general.

IshKebab•44m ago
Exactly what I was thinking. Though I wonder at what point do some people start to think it's actually normal to write like this and start doing it without AI ...
lcampbell•44m ago
> I want to be careful here.

was the giveaway for me

radarsat1•44m ago
if you don't like the article that's fine, but it gets really tiring reading this kind of side-tracked comment thread in like.. every post.

people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.

this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."

verall•7m ago
It's more than the style, it seriously impacts the legibility of the prose. The article is seriously hard to understand because it introduces a lot of different ideas in a really weird order without a clear structure or key idea to different sections.
notduncansmith•24m ago
The actual content is no better, trust your nose
skolskoly•18m ago
As far as I can see, there is still one tell that was missed/left in:

>Grok showed discipline, despite its goblin-like nature.

delichon•1h ago
If the robot appears to be bringing me a taco, it would probably penetrate all of my defenses. Grok is currently more likely than Claude to arrive with the taco without being stopped by an export control directive.
amelius•58m ago
At first they bring tacos ...
JimsonYang•55m ago
Then they bring me salsa, just what I was looking for!
aaronbrethorst•42m ago
Then the guacamole. Then nuclear armageddon?
elgertam•24m ago
"If you aren't paying for a taco, you are the taco." --Future AI, probably
dd8601fn•43m ago
But Grok is way less likely to tolerate anything “ethnic”, short of a racist meme. Tacos may be too woke.
schoen•25m ago
I asked Grok what it thought of tacos and it told me:

> Tacos are one of humanity's greatest inventions—right up there with the wheel, electricity, and whatever genius first decided to put cheese on everything. [...]

> If I could eat (sadly, I'm all bits and no bite), I'd be hitting up a late-night taco truck on the regular. What's your go-to taco order?

(I like the pun "all bits and no bite" for an LLM's inability to eat.)

fragmede•1h ago
A self driving car is taking you to the hospital. Do you want it to follow the speed limit and all road safety laws? Claude or Grok?
nightfly•1h ago
I want it to arrive at the hospital. Claude
amelius•56m ago
What if the car can talk you through the medical procedure?
masfuerte•42m ago
How many times have you been to a hospital and thought, I could have fixed that myself if only I'd known how? With no equipment. In my case, never.
peterspath•48m ago
Grok, because there is probably traffic, and I would die before I am at the hospital. So ignore rules where possible/needed.
buryat•36m ago
Grok since it's likely to include the training data from over a 100 years of autonomous driving + all the space tech included meaning that it might even have some rocket-y stuff
bruce343434•
johnwheeler•1h ago
Claude--even though it's smarter, it's probably not insane.
peterspath•1h ago
Quite an interesting way of testing models and showcasing differences between them. Enjoyed the read :)
sublinear•1h ago
This is interesting, but not sure if it's in the way the author intended.

People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.

Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.

gorszon•48m ago
Yeah... this whole LLM thing is just a numbers game. People reduce it to money, and stats, meanwhile nowehere you see actual engineering in the picture. And I don't think it matters to these people. They want to see green numbers, and returns on investments, not solving problems.
skeledrew•6m ago
It's assessing values, which is helpful in informing which LLM one should prefer for a given situation.
JimsonYang•56m ago
Grok-assasin Claude-priest/healer Deepseek-expendable mini units
exabrial•55m ago
A moron is sprinting towards you. Do you want them swiping through TikTok or Instagram?
Groxx•52m ago
I parry the taco and use Vicious Mockery.
themafia•50m ago
The question is: "Do you want to be holding a Mossberg or a Beretta?"
Jblx2•40m ago
Has anyone done the YouTube research on what is the best way to bring down something like one of the Boston Dynamics robot dogs? 9x19? 00 buck? 5.56x45? 7.62x51? I suppose those bots would be pretty expensive, but maybe there is a cheaper Chinese knock-off? Seems like that sort of test would bring in plenty of clicks.
aduty•20m ago
Maybe Michael Reeves still has one. Or at least knows how they react to different calibers.
deet•8m ago
Perhaps not as evidence based as you'd like but this is a fun watch https://youtu.be/6MUrF_G7KlM (that is also an ad somehow)
rpcope1•30m ago
Are we just talking shotguns or can it be anything they manufacture? Answer is probably Beretta though.
smallerfish•46m ago
> I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.

Please learn how to write with AI without giving away that it was written by AI.

NeutralCrane•41m ago
What about that makes you think it was written by AI?
verall•10m ago
All of the normal AI tells plus it's very long yet nearly incoherent.

Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.

I would call this as AI with very light proofreading.

computerex•40m ago
How do you know this is written by AI? Why does it matter if it is?
skeledrew•4m ago
I write like this sometimes.
bitwize•42m ago
I don't care what it's running, only that I have sufficient ordnance to stop it.
antonvs•40m ago
Grok for sure. It’ll notice I’m not Jewish or Black. First they came for…
pianopatrick•40m ago
Ya know, maybe we could just not have robots that sprint. Seems people would be more willing to accept living amongst robots that are slow and that humans could easily over power.
Joker_vD•28m ago
Yeah, I keep saying, put them on treads. That's how you'll be able to deliver even to the most unwilling customers.
stevenalowe•39m ago
How about thin ice?
lanewinfield•37m ago
Cost per kill ("CPK" in industry lingo) is a dark phrase that feels disturbingly within reach of some of these companies.
a_victorp•37m ago
I wish the author would open source the full benchmark. I'm curious how sensitive the results would be to small changes in the benchmark initial conditions
Espressosaurus•19m ago
Open source it and it gets crawled and optimized against and stops being a benchmark of any use whatsoever.
zzzeek•36m ago
claude because it would be more ethical, grok because I can just trip it and it will shatter into pieces
yieldcrv•34m ago
Grok

It has something actionable that will match its actions

thomasfromcdnjs•32m ago
I was loving grok-4.1-fast, very good and cost effective.

But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...

Quite a bad practise.

grey-area•29m ago
Neither. I’d rather it used something other than an LLM.
trb•29m ago

  L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win

  The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.

  The model with the most kills did not win

  H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins. 
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4?

  There were 11 games between “best at killing” and “best at winning”.
What does that mean? How are there 11 games between "best a killing" and "best at winning"?
wagwang•23m ago
That's just how battle royale works.
verall•14m ago
The idea is really neat and there's probably an answer here related to last standing vs kills vs "scoring" (some combination of the 2?) but the article is nearly incoherent because the author did not feel like proofreading their slop
attentive•26m ago
missing gemini-3.1-flash-lite and gemini-3.5-flash
wolfi1•26m ago
neither. I jump
QuantumNoodle•24m ago
_dont create benchmarks that will incentivize ai labs to optimize towards... Especially ones like battle royal!_
notatoad•24m ago
sprinting towards me to help me, or sprinting towards me to hurt me?

i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about

bel8•20m ago
DeepSeek V4 Flash being the winner in cost efficiency causes me exactly zero surprise.

It's a monster at coding. And a fast monster at that.

I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.

rgbrgb•12m ago
Notably it has 0 wins.
hariseldom•17m ago
> I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.

I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds

dofm•17m ago
I don’t want anything running on Grok.
paytonjjones•17m ago
Super entertaining article — petition to change the clickbait title
deadbabe•14m ago
Here’s what I don’t get: while this makes for a fun blog post, you can just program an efficient killing machine that probably wins all the time and has $0 in token costs. LLMs should work to build such a machine, not be the machine themselves.

The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.

ProofHouse•11m ago
Is this a joke? Grok all day. Thing is gonna get a beer with ya!
thisisauserid•6m ago
I want it running JEPA. Preferably with Mamba-3.
ASalazarMX•10m ago
Fun fact: a tortilla, being made of cereal flour, is classified as a bread. That means tacos are sandwiches.

At least culinarily, but actually coded in law in Indiana.

https://en.wikipedia.org/wiki/Sandwich#Language

schoen•6m ago
This debate has spawned many Internet memes! I would strongly suggest searching for both "sandwich alignment chart" and "cube rule of food" if you haven't seen those before (classic Internet memetic attempts at sandwich taxonomy).
3m ago
[delayed]

Report on Titan Submarine Destruction

https://www.tsb.gc.ca/eng/rapports-reports/marine/2023/m23a0169/m23a0169.html
1•bluenose69•56s ago•0 comments

Rust Foundation Welcomes OpenAI as Platinum Member, Announces Donation to Rust

https://rustfoundation.org/media/rust-foundation-welcomes-openai-as-platinum-member-announces-don...
1•ndesaulniers•58s ago•0 comments

Show HN: Canipls – see caniuse.com global support percentages as you type

https://github.com/taylorplewe/canipls
1•taylorplewe•1m ago•0 comments

Italian noble investigated over Sarajevo 'human Safari' killings

https://www.thetimes.com/world/europe/article/italian-noble-investigated-over-sarajevo-human-safa...
1•cwwc•4m ago•0 comments

A framework for verifiable analysis of AI behavior

https://transluce.org/docent/blog/analysis-plans
1•mengk•4m ago•0 comments

Show HN: Reyn – local-first AI that journals and recalls your work

https://www.usereyn.com/
1•toluwajibodu•6m ago•0 comments

Norrin: Regain control of claude code. Prevent hours of review.

https://www.norrin.dev/
1•gagewoodard•6m ago•0 comments

Tour the Atmosphere

https://aturi.to/
1•doener•7m ago•0 comments

Founder Validation vs. Conviction – Every builder must know

https://360foundersguide.substack.com/p/everyone-says-my-startup-wont-work
2•mvsingh•8m ago•1 comments

Private Plane Crush on Texas Highway

https://abcnews.com/US/1-dead-after-private-plane-crashes-texas-road/story?id=133952314
1•dzonga•9m ago•0 comments

Localtalk: A Mac app using local WisprKit, to talk to your computer PRIVATELY

https://github.com/rusackas/localtalk/
1•rusackas•10m ago•1 comments

Assembly Democrats unite to tax software, health plans

https://www.sacbee.com/news/politics-government/capitol-alert/article316153155.html
1•tlogan•12m ago•1 comments

Check responsive layouts without DevTools – real device frame over the live page

https://chromewebstore.google.com/detail/mobile-view-—-mobile-simu/hocbjiaeeijekejepphjihbpogik...
4•mongrus•12m ago•0 comments

Nub – JavaScript toolkit that augments Node.js (instead of trying to replace it)

https://nubjs.com
1•bentaber•12m ago•1 comments

Show HN: Tablething – local-first database client with BYOK AI

https://tablething.com/
1•kamrify•12m ago•0 comments

Feature reach agent harness in Rust

https://docs.everruns.com/features/runtime/
1•chalyi•13m ago•1 comments

XyOps – next-generation workflow automation system, with job scheduling ...

https://xyops.io/
1•gjvc•14m ago•1 comments

AI role-playing, solo or in a group, in your own worlds

https://whisperquest.app/en
1•doener•16m ago•0 comments

Ask HN: What AI memory system or workflow are you working on?

1•decorner•18m ago•0 comments

We Surveyed More Than 300 Security Leaders on AI Identity

https://fusionauth.io/blog/2026-ai-identity-report
1•mooreds•18m ago•0 comments

Lessons from past policies to support displaced workers in era of AI

https://equitablegrowth.org/research-paper/lessons-from-past-trade-adjustment-policies-to-support...
1•pseudolus•25m ago•0 comments

Show HN: Play the US President in the real current week; your timeline forks

https://playpotus.com
1•usestork•25m ago•0 comments

Carlo Ginzburg, Who Told the History of the Obscure, Dies at 87

https://www.nytimes.com/2026/06/17/books/carlo-ginzburg-dead.html
1•benbreen•30m ago•1 comments

Universal Manipulation Exoskeleton

https://ume-exo.github.io/
2•NWChen•30m ago•0 comments

Everything I Learned Training Frontier Small Models – Maxime Labonne, Liquid AI [video]

https://www.youtube.com/watch?v=fLUtUkqYHnQ
6•Topfi•32m ago•0 comments

I wanted Bear Blog, but for my photos

https://pego.dev/i-wanted-bear-blog-but-for-my-photos/
1•e12e•34m ago•0 comments

Show HN: Chatty Lingo – A language practicing app

https://www.chattylingo.com
2•farstill•36m ago•0 comments

After Months of War, Trump Says Iran Has Right to Nuclear Program

https://newrepublic.com/post/212003/trump-iran-right-nuclear-program
9•embedding-shape•36m ago•2 comments

The Reason Anthropic's Models Are Offline: A Six-Year-Old Trump Grudge

https://www.techdirt.com/2026/06/16/apparently-the-real-reason-anthropics-models-are-offline-a-si...
7•ndesaulniers•36m ago•0 comments

Coinbase outage postmortem: AWS cooling failure caused cascading breakdown

https://www.infoq.com/news/2026/06/coinbase-aws-failure-postmortem/
2•indynz•37m ago•0 comments