frontpage.

We’re the team at Vibrant Labs (W24). We’ve been building envs for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.

*What’s next:*

We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

Blog post: https://vibrantlabs.com/blog/pa-bench

'Probably' doesn't mean the same thing to your AI as it does to you

Yxorp.app – Reverse proxy for personal websites on home WiFi

SFQ: Simple, Stateless, Stochastic Fairness

The perks of being a mole rat

The Tax Nerd Who Bet His Life Savings Against DOGE

Remarkable reusable liquid stores solar energy like bottled sunlight

The x402 Service Discovery – runtime endpoint finder for the agent economy

The gold plating of American water

Hardworking teams still miss the goal

Jane Street faces claims of insider trading that sped up Terraform's collapse

Polsia: AI That Runs Your Company

The Peace Corps is recruiting volunteers to sell AI to developing nations

Body Futurism

Show HN: Can we simplify front end again? Meet DynamoJS

Best unrestricted AI video tools?

Show HN: Naperville Library Spy

Yabai: A tiling window manager for macOS based on binary space partitioning

Diamond owl swoops in with new method to keep electronics cool

The cartography of reason

Show HN: Cosmos-Reason2-2B on Nano Super

Connect your AI agent to every chat platform

Software companies buying software: a story of ecosystems and vendors

Engineering heat-tolerant, high-yield rice for a warming planet

Kalshi suspends users for insider trading

Hoot v0.8 released: new REPL enabling Scheme live coding in the browser

Trending Next.js Packages

A Chinese official's use of ChatGPT revealed a global intimidation operation

CSS is too powerful now [video]

It's Not Magic, It's Metapragmatic: Memetics Through the Lens of Semiotics

Show HN: Opty – A Zig-based HDC that reduces token use by up to 90%

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

Comments