frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: PhAIL – Real-robot benchmark for AI models

https://phail.ai
17•vertix•4h ago
I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.

PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.

Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.

Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.

Happy to answer questions about methodology, the models, or what we observed.

[1] Vision-Language-Action: https://en.wikipedia.org/wiki/Vision-language-action_model

Comments

anna_pozniak•4h ago
I'm curious! What other models you're planning to add to the leaderboard?
vertix•4h ago
We're working on adding DreamZero (NVIDIA's latest) next. The leaderboard is open to any model – both open-source and closed-source. If you have a checkpoint, we'll run it on the same hardware under the same blind protocol. Closed-source participants can submit their model as a container and we evaluate it without accessing the weights. Reach out at hi@phail.ai if you want to submit.
akshaisarathy•4h ago
If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?
vertix•4h ago
All real hardware, no simulation. Franka FR3 arm with a Robotiq gripper, physical totes, real objects. Every run is recorded with synced video and telemetry (you can watch any episode on the site).

That's the whole point – simulation benchmarks exist, but operators deploying robots care about real-world performance.

vladimir_gor•3h ago
I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow.
chfritz•2h ago
This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!
vertix•2h ago
Feel free to reach me out via hi at phail dot ai
apetrovicheva•1h ago
This is amazing. Loved watching the videos with real-world attempts.

Finally a real benchmark vs polished teleoperated twitter videos. Shows the real state of a super important industry, and there’s a lot of work to do.

Show HN: Postgres extension for BM25 relevance-ranked full-text search

https://github.com/timescale/pg_textsearch
45•tjgreen•4h ago•13 comments

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

https://github.com/jkool702/forkrun
73•jkool702•4d ago•10 comments

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

https://cerno.sh
9•plawlost•1h ago•18 comments

Show HN: PhAIL – Real-robot benchmark for AI models

https://phail.ai
17•vertix•4h ago•8 comments

Show HN: Loreline, narrative language transpiled via Haxe: C++/C#/JS/Java/Py/Lua

https://loreline.app/en/docs/technical-overview/
46•jeremyfa•3d ago•14 comments

Show HN: EU Leadership – Live API data site comparing Europe to the world

https://ajh.ovh/
5•aureljohn•1h ago•1 comments

Show HN: Sundial – a new way to look at a weather forecast

https://sundial.page/
22•izaidi•5h ago•10 comments

Show HN: Multi-agent autoresearch for ANE inference beats Apple's CoreML by 6×

https://www.ensue-network.ai/lab/ane
4•christinetyip•1h ago•0 comments

Show HN: Hyprmoncfg – Terminal-based monitor config manager for Hyprland

https://paolino.me/hyprmoncfg-monitor-configuration-for-hyprland/
10•earcar•4h ago•4 comments

Show HN: I turned a sketch into a 3D-print pegboard for my kid with an AI agent

https://github.com/virpo/pegboard
62•virpo•21h ago•16 comments

Show HN: Pardus Browser- a browser for AI agents without Chromium

https://github.com/JasonHonKL/PardusBrowser/tree/main
15•JasonHEIN•10h ago•8 comments

Show HN: Coasts – Containerized Hosts for Agents

https://github.com/coast-guard/coasts
89•jsunderland323•1d ago•37 comments

Show HN: Margo – Find the font your brain reads fastest

https://margo.fyi/
2•theseidel•3h ago•1 comments

Show HN: Lazy-tool: reducing prompt bloat in MCP-based agent workflows

https://github.com/rpgeeganage/lazy-tool
18•like-to-code1•4h ago•2 comments

Show HN: ClawDesk – Agent orchestration layer on top of OpenClaw

https://github.com/glassrun/clawdesk
3•glassrun•4h ago•0 comments

Show HN: I built a self-hosted Fly.io engine using Go and Firecracker

https://github.com/herd-core/herd
4•sankalpnarula•4h ago•0 comments

Show HN: Solitaire – identity layer for AI agents, not just another memory tool

https://github.com/PRDicta/Solitaire-for-Agents
4•dictadev•4h ago•1 comments

Show HN: LogicStamp – A Context Compiler for TypeScript

https://logicstamp.dev
3•AmiteK•5h ago•1 comments

Show HN: INTERCALsky.ATproto client.Ada carries packets.INTERCAL carries meaning

https://github.com/FormerLab/intercalsky
4•FormerLabFred•5h ago•0 comments

Show HN: Gravimera, AI(LLM) driven 3D world editor and explorer

https://github.com/gravimera/gravimera
4•FlowWei•5h ago•0 comments

Show HN: Prawduct, a product development framework for Claude Code

https://github.com/brookstalley/prawduct
3•brookst•5h ago•1 comments

Show HN: Reprompt – Analyze what you type into AI tools, not what they output

https://github.com/reprompt-dev/reprompt
3•LuxBennu•5h ago•3 comments

Show HN: DeepTable – an API that converts messy Excel files into structured data

https://docs.deeptable.com/
4•francisrafal•5h ago•0 comments

Show HN: PromptQL – AI-Native Slack

https://promptql.io
4•argo12•5h ago•0 comments

Show HN: Vibe Check – UX Benchmark for vibe designs

https://vibecheck.appvelocity.io
4•aEJ04Izw5HYm•6h ago•1 comments

Show HN: Trama – Stop writing agent orchestration

https://github.com/NaNhkNaN/trama
6•NaNhkNaN•6h ago•0 comments

Show HN: Wageslave – I quit my soul sucking job to make a game about it

https://cauldron.itch.io/wageslave
7•stonecauldron•4h ago•2 comments

Show HN: Rust UEFI UI Lib

https://github.com/sloev/uefi-ui
4•supernihil•7h ago•2 comments

Show HN: Signboard – Kanban app lists are folders and cards are Markdown files

https://cdevroe.com/signboard/
4•cdevroe•8h ago•1 comments

Show HN: An extension that opens any Goodreads book in anna's or Zlib in a click

https://chromewebstore.google.com/detail/goodlib-zlib-annas-archiv/aiampblkjnmfogckjfiecodcnenleehp
2•NubPlayz•3h ago•2 comments