Show HN: PhAIL – Real-robot benchmark for AI models. The gap to humans is 20x

6•vertix•1h ago

Comments

vertix•1h ago

I built this because I couldn't find honest numbers on how well VLA models actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.

PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.

Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.

Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.

Happy to answer questions about methodology, the models, or what we observed.

anna_pozniak•1h ago

I'm curious! What other models you're planning to add to the leaderboard?

vertix•1h ago

We're working on adding DreamZero (NVIDIA's latest) next. The leaderboard is open to any model – both open-source and closed-source. If you have a checkpoint, we'll run it on the same hardware under the same blind protocol. Closed-source participants can submit their model as a container and we evaluate it without accessing the weights. Reach out at hi@phail.ai if you want to submit.

akshaisarathy•1h ago

If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?

vertix•1h ago

All real hardware, no simulation. Franka FR3 arm with a Robotiq gripper, physical totes, real objects. Every run is recorded with synced video and telemetry (you can watch any episode on the site).

That's the whole point – simulation benchmarks exist, but operators deploying robots care about real-world performance.

vladimir_gor•52m ago

I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow.

chfritz•5m ago

This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!

Breaking Enigma with Index of Coincidence on a Commodore 64

The $200 plastic box opportunity

'junk': E-waste from rich nations floods local markets in Nigeria

Design structure becomes code structure

Let me tell you how much I've come to hate you since you were created

Show HN: An agent skill that tracks SF city hearings, permits, lobbying etc.

Global opinion on OpenAI dropped insanely

Google Now Lets You Change Your Gmail Address

Fuck Web Services

Iran says it will target US tech companies in Middle East

The Child That Surpassed Both Parents Through MRI-Guided Evolutionary Merge

Show HN: SuprLogs – Autopilot changelogs from GitHub commits

Mornington Crescent

A bug in Bun may have been the root cause of the Claude Code source code leak

Centaur Programming

Redpanda Cloud Topics Architecture

Iran threatens to attack US tech companies

OkCupid gave 3M dating-app photos to facial recognition firm, FTC says

Google Touts Dubious Android Browser Benchmarks; Press Swallows It Whole

When to Use an Agent

Show HN: OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost

How we made Trail of Bits AI-native (so far)

Show HN: An extension that opens any Goodreads book in anna's or Zlib in a click

NPM's Defaults Are Bad

Go Proverbs

Show HN: Open-FDD – On Prem HVAC Fault Detection with OpenClaw, BACnet, & Brick

I Decompiled the White House's New App

How HN: Synthetic You – AI calibrated to your personality

AI agent is authorized to do everything wrong

Testing the "Yes-Man" in Your Pocket

Show HN: PhAIL – Real-robot benchmark for AI models. The gap to humans is 20x

Comments

Breaking Enigma with Index of Coincidence on a Commodore 64

The $200 plastic box opportunity

'junk': E-waste from rich nations floods local markets in Nigeria

Design structure becomes code structure

Let me tell you how much I've come to hate you since you were created

Show HN: An agent skill that tracks SF city hearings, permits, lobbying etc.

Global opinion on OpenAI dropped insanely

Google Now Lets You Change Your Gmail Address

Fuck Web Services

Iran says it will target US tech companies in Middle East

The Child That Surpassed Both Parents Through MRI-Guided Evolutionary Merge

Show HN: SuprLogs – Autopilot changelogs from GitHub commits

Mornington Crescent

A bug in Bun may have been the root cause of the Claude Code source code leak

Centaur Programming

Redpanda Cloud Topics Architecture

Iran threatens to attack US tech companies

OkCupid gave 3M dating-app photos to facial recognition firm, FTC says

Google Touts Dubious Android Browser Benchmarks; Press Swallows It Whole

When to Use an Agent

Show HN: OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost

How we made Trail of Bits AI-native (so far)

Show HN: An extension that opens any Goodreads book in anna's or Zlib in a click

NPM's Defaults Are Bad

Go Proverbs

Show HN: Open-FDD – On Prem HVAC Fault Detection with OpenClaw, BACnet, & Brick

I Decompiled the White House's New App

How HN: Synthetic You – AI calibrated to your personality

AI agent is authorized to do everything wrong

Testing the "Yes-Man" in Your Pocket