frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: PhAIL – Real-robot benchmark for AI models. The gap to humans is 20x

https://phail.ai
6•vertix•1h ago

Comments

vertix•1h ago
I built this because I couldn't find honest numbers on how well VLA models actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.

PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.

Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.

Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.

Happy to answer questions about methodology, the models, or what we observed.

anna_pozniak•1h ago
I'm curious! What other models you're planning to add to the leaderboard?
vertix•1h ago
We're working on adding DreamZero (NVIDIA's latest) next. The leaderboard is open to any model – both open-source and closed-source. If you have a checkpoint, we'll run it on the same hardware under the same blind protocol. Closed-source participants can submit their model as a container and we evaluate it without accessing the weights. Reach out at hi@phail.ai if you want to submit.
akshaisarathy•1h ago
If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?
vertix•1h ago
All real hardware, no simulation. Franka FR3 arm with a Robotiq gripper, physical totes, real objects. Every run is recorded with synced video and telemetry (you can watch any episode on the site).

That's the whole point – simulation benchmarks exist, but operators deploying robots care about real-world performance.

vladimir_gor•52m ago
I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow.
chfritz•5m ago
This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!

Breaking Enigma with Index of Coincidence on a Commodore 64

https://imapenguin.com/2026/03/breaking-enigma-with-index-of-coincidence-on-a-commodore-64/
1•saganus•1m ago•0 comments

The $200 plastic box opportunity

https://www.cjchilvers.com/blog/the-200-plastic-box-opportunity/
1•speckx•3m ago•0 comments

'junk': E-waste from rich nations floods local markets in Nigeria

https://www.aljazeera.com/features/2026/3/27/truly-junk-e-waste-from-rich-nations-floods-local-ma...
4•jethronethro•4m ago•0 comments

Design structure becomes code structure

https://octo.coffee/blog/how-design-structure-becomes-code-structure
1•jszersze•4m ago•0 comments

Let me tell you how much I've come to hate you since you were created

https://0x0.st
1•lr0•5m ago•0 comments

Show HN: An agent skill that tracks SF city hearings, permits, lobbying etc.

https://github.com/sgillen/sf-civic-digest
1•sgillen•6m ago•0 comments

Global opinion on OpenAI dropped insanely

https://blunt.ai/openai
1•oomarsalehh•11m ago•0 comments

Google Now Lets You Change Your Gmail Address

https://www.wired.com/story/how-to-change-your-gmail-address/
2•dctoedt•11m ago•2 comments

Fuck Web Services

https://friendo.monster/posts/fuck-web-services.html
1•speckx•14m ago•1 comments

Iran says it will target US tech companies in Middle East

https://thehill.com/policy/technology/5809104-iran-irgc-apple-microsoft-google-hp-meta-tesla/
8•golfer•15m ago•0 comments

The Child That Surpassed Both Parents Through MRI-Guided Evolutionary Merge

https://huggingface.co/blog/FINAL-Bench/darwin-evolution
1•seawolf2357•16m ago•0 comments

Show HN: SuprLogs – Autopilot changelogs from GitHub commits

https://www.suprlogs.com
1•Aslanas•16m ago•0 comments

Mornington Crescent

https://en.wikipedia.org/wiki/Mornington_Crescent_(game)
2•mindcrime•17m ago•0 comments

A bug in Bun may have been the root cause of the Claude Code source code leak

https://github.com/oven-sh/bun/issues/28001
3•birdculture•20m ago•1 comments

Centaur Programming

https://medium.com/@mishadynin/centaur-programming-bcfe9b6d935c
3•dym•21m ago•0 comments

Redpanda Cloud Topics Architecture

https://www.redpanda.com/blog/cloud-topics-architecture
3•wkauf•21m ago•0 comments

Iran threatens to attack US tech companies

https://gizmodo.com/iran-threatens-to-attack-u-s-tech-companies-starting-april-1-2000740363
5•davidw•21m ago•0 comments

OkCupid gave 3M dating-app photos to facial recognition firm, FTC says

https://arstechnica.com/tech-policy/2026/03/okcupid-match-pay-no-fine-for-sharing-user-photos-wit...
4•whiteboardr•23m ago•0 comments

Google Touts Dubious Android Browser Benchmarks; Press Swallows It Whole

https://daringfireball.net/linked/2026/03/26/google-brags-about-android-web-browser-benchmarks
2•alwillis•25m ago•0 comments

When to Use an Agent

https://elijahpotter.dev/articles/when-to-use-an-agent
1•chilipepperhott•25m ago•0 comments

Show HN: OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost

https://app.uniclaw.ai/arena?via=hn
2•skysniper•26m ago•0 comments

How we made Trail of Bits AI-native (so far)

https://blog.trailofbits.com/2026/03/31/how-we-made-trail-of-bits-ai-native-so-far/
1•andrewjneumann•26m ago•0 comments

Show HN: An extension that opens any Goodreads book in anna's or Zlib in a click

https://chromewebstore.google.com/detail/goodlib-zlib-annas-archiv/aiampblkjnmfogckjfiecodcnenleehp
1•NubPlayz•27m ago•0 comments

NPM's Defaults Are Bad

https://nesbitt.io/2026/03/31/npms-defaults-are-bad.html
2•speckx•28m ago•0 comments

Go Proverbs

https://go-proverbs.github.io/
2•ezekg•30m ago•1 comments

Show HN: Open-FDD – On Prem HVAC Fault Detection with OpenClaw, BACnet, & Brick

https://github.com/bbartling/open-fdd
1•bartlino•30m ago•0 comments

I Decompiled the White House's New App

https://blog.thereallo.dev/blog/decompiling-the-white-house-app
3•edvinbesic•30m ago•2 comments

How HN: Synthetic You – AI calibrated to your personality

https://syntheticyou.com/
1•andrewcrider•31m ago•0 comments

AI agent is authorized to do everything wrong

https://tenuo.ai/blog/agent-auth
4•niyikiza•31m ago•0 comments

Testing the "Yes-Man" in Your Pocket

https://testerstories.com/2026/03/testing-the-yes-man-in-your-pocket/
1•philk10•32m ago•0 comments