frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

https://vibrantlabs.com/blog/pa-bench
9•shahules•3h ago
We’re the team at Vibrant Labs (W24). We’ve been building envs for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.

*What’s next:*

We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

Blog post: https://vibrantlabs.com/blog/pa-bench

Comments

shahules•1h ago
Founder of Vibrant Labs here. We’re working on automating the synthesis of high-quality evals and RL data for LLM agents.

Some of the things we’re exploring:

1.Automated task and verifier generation

2.Synthesizing coherent worlds for evaluating and training agents

3.Continual learning setups for long-horizon agents

Would love to talk with anyone who's interested to know more!

abhijithneil•1h ago
Is there a possible way computer use can be automated using multiple computer use agents from different providers, but also with some sort of routing setup so the best course of action can be chosen without hitting failures (for eg: permission issues in OpenAI could be rerouted to Gemini)
shahules•33m ago
There are few agents like browser-use, skyvern etc that may provide this capability.

Banned in California

https://www.bannedincalifornia.org/
52•pie_flavor•23m ago•19 comments

Jimi Hendrix was a systems engineer

https://spectrum.ieee.org/jimi-hendrix-systems-engineer
226•tintinnabula•3h ago•88 comments

Making MCP cheaper via CLI

https://kanyilmaz.me/2026/02/23/cli-vs-mcp.html
82•thellimist•3h ago•47 comments

First Website

https://info.cern.ch
14•shrikaranhanda•36m ago•1 comments

The Om Programming Language

https://www.om-language.com/
214•tosh•5h ago•41 comments

Bus stop balancing is fast, cheap, and effective

https://worksinprogress.co/issue/the-united-states-needs-fewer-bus-stops/
265•surprisetalk•7h ago•430 comments

Windows 11 Notepad to support Markdown

https://blogs.windows.com/windows-insider/2026/01/21/notepad-and-paint-updates-begin-rolling-out-...
140•andreynering•6h ago•276 comments

Show HN: Respectify – A comment moderator that teaches people to argue better

https://respectify.org/
59•vintagedave•9h ago•90 comments

Large-Scale Online Deanonymization with LLMs

https://simonlermen.substack.com/p/large-scale-online-deanonymization
163•DalasNoin•1d ago•149 comments

The First Fully General Computer Action Model

https://si.inc/posts/fdm1/
92•nee1r•2d ago•32 comments

Learnings from 4 months of Image-Video VAE experiments

https://www.linum.ai/field-notes/vae-reconstruction-vs-generation
51•schopra909•1d ago•8 comments

Dissecting the CPU-memory relationship in garbage collection (OpenJDK 26)

https://norlinder.nu/posts/GC-Cost-CPU-vs-Memory/
29•jonasn•1d ago•9 comments

Why every automaker is quietly bringing back the inline-six engine

https://carbuzz.com/why-automakers-bringing-back-the-inline-six-engine/
19•teleforce•3d ago•13 comments

Show HN: I ported Tree-sitter to Go

https://github.com/odvcencio/gotreesitter
165•odvcencio•5h ago•67 comments

Following 35% growth, solar has passed hydro on US grid

https://arstechnica.com/science/2026/02/final-2025-data-is-in-us-energy-use-is-up-as-solar-passes...
357•rbanffy•6h ago•288 comments

How to fold the Blade Runner origami unicorn (1996)

https://web.archive.org/web/20011104015933/www.linkclub.or.jp/~null/index_br.html
241•exvi•3d ago•34 comments

The Misuses of the University

https://www.publicbooks.org/the-misuses-of-the-university/
112•ubasu•7h ago•79 comments

Access to a Shared Unix Computer

http://tilde.club/
30•TigerUniversity•3d ago•8 comments

Trellis AI (YC W24) is hiring deployment lead to accelerate medication access

https://www.ycombinator.com/companies/trellis-ai/jobs/7ZlvQkN-lead-deployment-strategist
1•macklinkachorn•6h ago

GNU Texmacs

https://www.texmacs.org/tmweb/home/welcome.en.html
112•remywang•8h ago•41 comments

Devirtualization and Static Polymorphism

https://david.alvarezrosa.com/posts/devirtualization-and-static-polymorphism/
31•dalvrosa•4h ago•11 comments

Claude Code Remote Control

https://code.claude.com/docs/en/remote-control
467•empressplay•16h ago•271 comments

Never buy a .online domain

https://www.0xsid.com/blog/online-tld-is-pain
637•ssiddharth•10h ago•400 comments

Why isn't LA repaving streets?

https://lapublicpress.org/2026/02/why-isnt-la-repaving-streets/
83•speckx•6h ago•163 comments

Launch HN: TeamOut (YC W22) – AI agent for planning company retreats

https://app.teamout.com/ai
38•vincentalbouy•9h ago•46 comments

New accounts on HN more likely to use em-dashes

https://www.marginalia.nu/weird-ai-crap/hn/
557•todsacerdoti•9h ago•468 comments

Text-Based Google Directions

https://gdir.telae.net/
46•TigerUniversity•4d ago•15 comments

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

https://vibrantlabs.com/blog/pa-bench
9•shahules•3h ago•3 comments

Danish government agency to ditch Microsoft software (2025)

https://therecord.media/denmark-digital-agency-microsoft-digital-independence
711•robtherobber•13h ago•363 comments

US orders diplomats to fight data sovereignty initiatives

https://www.reuters.com/sustainability/boards-policy-regulation/us-orders-diplomats-fight-data-so...
413•colinhb•8h ago•354 comments