We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.
*What’s next:*
We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.
Blog post: https://vibrantlabs.com/blog/pa-bench
shahules•13m ago
Some of the things we’re exploring:
1.Automated task and verifier generation
2.Synthesizing coherent worlds for evaluating and training agents
3.Continual learning setups for long-horizon agents
Would love to talk with anyone who's interested to know more!