I built this because I couldn't find honest numbers on how well VLA models actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.
PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.
Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.
Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.
Happy to answer questions about methodology, the models, or what we observed.
anna_pozniak•1h ago
I'm curious! What other models you're planning to add to the leaderboard?
vertix•1h ago
We're working on adding DreamZero (NVIDIA's latest) next. The leaderboard is open to any model – both open-source and closed-source. If you have a checkpoint, we'll run it on the same hardware under the same blind protocol. Closed-source participants can submit their model as a container and we evaluate it without accessing the weights. Reach out at hi@phail.ai if you want to submit.
akshaisarathy•1h ago
If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?
vertix•1h ago
All real hardware, no simulation. Franka FR3 arm with a Robotiq gripper, physical totes, real objects. Every run is recorded with synced video and telemetry (you can watch any episode on the site).
That's the whole point – simulation benchmarks exist, but operators deploying robots care about real-world performance.
vladimir_gor•52m ago
I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks.
Will be interesting to see how fast this gap will narrow.
chfritz•5m ago
This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!
vertix•1h ago
PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.
Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.
Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.
Happy to answer questions about methodology, the models, or what we observed.