Show HN: PhAIL – Real-robot benchmark for AI models

17•vertix•4h ago

I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.

PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.

Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.

Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.

Happy to answer questions about methodology, the models, or what we observed.

[1] Vision-Language-Action: https://en.wikipedia.org/wiki/Vision-language-action_model

Comments

anna_pozniak•4h ago

I'm curious! What other models you're planning to add to the leaderboard?

vertix•4h ago

We're working on adding DreamZero (NVIDIA's latest) next. The leaderboard is open to any model – both open-source and closed-source. If you have a checkpoint, we'll run it on the same hardware under the same blind protocol. Closed-source participants can submit their model as a container and we evaluate it without accessing the weights. Reach out at hi@phail.ai if you want to submit.

akshaisarathy•4h ago

If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?

vertix•4h ago

All real hardware, no simulation. Franka FR3 arm with a Robotiq gripper, physical totes, real objects. Every run is recorded with synced video and telemetry (you can watch any episode on the site).

That's the whole point – simulation benchmarks exist, but operators deploying robots care about real-world performance.

vladimir_gor•3h ago

I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow.

chfritz•2h ago

This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!

vertix•2h ago

Feel free to reach me out via hi at phail dot ai

apetrovicheva•1h ago

This is amazing. Loved watching the videos with real-world attempts.

Finally a real benchmark vs polished teleoperated twitter videos. Shows the real state of a super important industry, and there’s a lot of work to do.

Show HN: Postgres extension for BM25 relevance-ranked full-text search

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

Show HN: PhAIL – Real-robot benchmark for AI models

Show HN: Loreline, narrative language transpiled via Haxe: C++/C#/JS/Java/Py/Lua

Show HN: EU Leadership – Live API data site comparing Europe to the world

Show HN: Sundial – a new way to look at a weather forecast

Show HN: Multi-agent autoresearch for ANE inference beats Apple's CoreML by 6×

Show HN: Hyprmoncfg – Terminal-based monitor config manager for Hyprland

Show HN: I turned a sketch into a 3D-print pegboard for my kid with an AI agent

Show HN: Pardus Browser- a browser for AI agents without Chromium

Show HN: Coasts – Containerized Hosts for Agents

Show HN: Margo – Find the font your brain reads fastest

Show HN: Lazy-tool: reducing prompt bloat in MCP-based agent workflows

Show HN: ClawDesk – Agent orchestration layer on top of OpenClaw

Show HN: I built a self-hosted Fly.io engine using Go and Firecracker

Show HN: Solitaire – identity layer for AI agents, not just another memory tool

Show HN: LogicStamp – A Context Compiler for TypeScript

Show HN: INTERCALsky.ATproto client.Ada carries packets.INTERCAL carries meaning

Show HN: Gravimera, AI(LLM) driven 3D world editor and explorer

Show HN: Prawduct, a product development framework for Claude Code

Show HN: Reprompt – Analyze what you type into AI tools, not what they output

Show HN: DeepTable – an API that converts messy Excel files into structured data

Show HN: PromptQL – AI-Native Slack

Show HN: Vibe Check – UX Benchmark for vibe designs

Show HN: Trama – Stop writing agent orchestration

Show HN: Wageslave – I quit my soul sucking job to make a game about it

Show HN: Rust UEFI UI Lib

Show HN: Signboard – Kanban app lists are folders and cards are Markdown files

Show HN: An extension that opens any Goodreads book in anna's or Zlib in a click

Show HN: PhAIL – Real-robot benchmark for AI models

Comments

Show HN: Postgres extension for BM25 relevance-ranked full-text search

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

Show HN: PhAIL – Real-robot benchmark for AI models

Show HN: Loreline, narrative language transpiled via Haxe: C++/C#/JS/Java/Py/Lua

Show HN: EU Leadership – Live API data site comparing Europe to the world

Show HN: Sundial – a new way to look at a weather forecast

Show HN: Multi-agent autoresearch for ANE inference beats Apple's CoreML by 6×

Show HN: Hyprmoncfg – Terminal-based monitor config manager for Hyprland

Show HN: I turned a sketch into a 3D-print pegboard for my kid with an AI agent

Show HN: Pardus Browser- a browser for AI agents without Chromium

Show HN: Coasts – Containerized Hosts for Agents

Show HN: Margo – Find the font your brain reads fastest

Show HN: Lazy-tool: reducing prompt bloat in MCP-based agent workflows

Show HN: ClawDesk – Agent orchestration layer on top of OpenClaw

Show HN: I built a self-hosted Fly.io engine using Go and Firecracker

Show HN: Solitaire – identity layer for AI agents, not just another memory tool

Show HN: LogicStamp – A Context Compiler for TypeScript

Show HN: INTERCALsky.ATproto client.Ada carries packets.INTERCAL carries meaning

Show HN: Gravimera, AI(LLM) driven 3D world editor and explorer

Show HN: Prawduct, a product development framework for Claude Code

Show HN: Reprompt – Analyze what you type into AI tools, not what they output

Show HN: DeepTable – an API that converts messy Excel files into structured data

Show HN: PromptQL – AI-Native Slack

Show HN: Vibe Check – UX Benchmark for vibe designs

Show HN: Trama – Stop writing agent orchestration

Show HN: Wageslave – I quit my soul sucking job to make a game about it

Show HN: Rust UEFI UI Lib

Show HN: Signboard – Kanban app lists are folders and cards are Markdown files

Show HN: An extension that opens any Goodreads book in anna's or Zlib in a click