frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

https://vibrantlabs.com/blog/pa-bench
7•shahules•1h ago
We’re the team at Vibrant Labs (W24). We’ve been building envs for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.

*What’s next:*

We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

Blog post: https://vibrantlabs.com/blog/pa-bench

Comments

shahules•13m ago
Founder of Vibrant Labs here. We’re working on automating the synthesis of high-quality evals and RL data for LLM agents.

Some of the things we’re exploring:

1.Automated task and verifier generation

2.Synthesizing coherent worlds for evaluating and training agents

3.Continual learning setups for long-horizon agents

Would love to talk with anyone who's interested to know more!

'Probably' doesn't mean the same thing to your AI as it does to you

https://theconversation.com/probably-doesnt-mean-the-same-thing-to-your-ai-as-it-does-to-you-275626
1•colinprince•55s ago•0 comments

Yxorp.app – Reverse proxy for personal websites on home WiFi

https://yxorp.app/auth/signin
1•freshman_dev•2m ago•0 comments

SFQ: Simple, Stateless, Stochastic Fairness

https://brooker.co.za/blog/2026/02/25/sfq.html
1•shayonj•2m ago•0 comments

The perks of being a mole rat

https://worksinprogress.co/issue/the-perks-of-being-a-mole-rat/
1•paulpauper•4m ago•0 comments

The Tax Nerd Who Bet His Life Savings Against DOGE

https://www.wsj.com/finance/investing/the-tax-nerd-who-bet-his-life-savings-against-doge-6b59eda2
1•igonvalue•4m ago•0 comments

Remarkable reusable liquid stores solar energy like bottled sunlight

https://newatlas.com/energy/molecular-solar-thermal-energy-storage-liquid/
1•westurner•4m ago•1 comments

The x402 Service Discovery – runtime endpoint finder for the agent economy

https://x402-discovery-api.onrender.com
1•rpl_ryan•4m ago•1 comments

The gold plating of American water

https://worksinprogress.co/issue/the-gold-plating-of-american-water/
1•paulpauper•5m ago•0 comments

Hardworking teams still miss the goal

https://medium.com/@PZBird/team-culture-what-holds-a-team-together-when-processes-fall-apart-9e5c...
1•PZBird•5m ago•0 comments

Jane Street faces claims of insider trading that sped up Terraform's collapse

https://www.coindesk.com/markets/2026/02/24/jane-street-faces-claims-of-insider-trading-that-sped...
1•paulpauper•6m ago•0 comments

Polsia: AI That Runs Your Company

https://polsia.com
1•seyz•6m ago•0 comments

The Peace Corps is recruiting volunteers to sell AI to developing nations

https://www.theverge.com/policy/884625/peace-corps-tech-promote-american-ai
1•toomuchtodo•7m ago•1 comments

Body Futurism

https://writing.tobyshorin.com/body-futurism/
1•firloop•7m ago•0 comments

Show HN: Can we simplify front end again? Meet DynamoJS

https://dynamojs.pages.dev/docs
1•novateg•7m ago•0 comments

Best unrestricted AI video tools?

https://unbound.video
1•gabrieln•10m ago•1 comments

Show HN: Naperville Library Spy

https://www.juandavidcampolargo.com/projects/naperlibspy
1•jdcampolargo•12m ago•0 comments

Yabai: A tiling window manager for macOS based on binary space partitioning

https://github.com/asmvik/yabai
1•fanf2•14m ago•0 comments

Diamond owl swoops in with new method to keep electronics cool

https://news.rice.edu/news/2026/diamond-owl-swoops-new-method-keep-electronics-cool
2•westurner•14m ago•1 comments

The cartography of reason

https://www.samrith.dev/blog/the-cartography-of-reason/
1•samrith•15m ago•0 comments

Show HN: Cosmos-Reason2-2B on Nano Super

https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2
1•vottivott•16m ago•0 comments

Connect your AI agent to every chat platform

https://github.com/pantalk/pantalk
1•_pdp_•16m ago•0 comments

Software companies buying software: a story of ecosystems and vendors

https://erikbern.com/2026/02/25/software-companies-buying-software-from-software-companies.html
1•articsputnik•17m ago•0 comments

Engineering heat-tolerant, high-yield rice for a warming planet

https://phys.org/news/2026-02-tolerant-high-yield-rice-planet.html
3•PaulHoule•18m ago•0 comments

Kalshi suspends users for insider trading

https://www.axios.com/2026/2/25/kalshi-insider-trading-suspension
3•upmind•20m ago•1 comments

Hoot v0.8 released: new REPL enabling Scheme live coding in the browser

https://spritely.institute/news/hoot-0-8-0-released.html
2•latinodev•20m ago•0 comments

Trending Next.js Packages

https://www.stacktco.com/js/ecosystems/nextjs/trends
1•matwiemann•21m ago•0 comments

A Chinese official's use of ChatGPT revealed a global intimidation operation

https://www.cnn.com/2026/02/25/politics/chatgpt-china-intimidation-operation
5•breve•22m ago•1 comments

CSS is too powerful now [video]

https://www.youtube.com/watch?v=Y-3tPDZCk2o
2•rasso•24m ago•0 comments

It's Not Magic, It's Metapragmatic: Memetics Through the Lens of Semiotics

https://sublius.substack.com/p/its-not-magic-its-metapragmatic-memetics
1•spacebacon•26m ago•0 comments

Show HN: Opty – A Zig-based HDC that reduces token use by up to 90%

https://github.com/boj/opty
3•bojo•27m ago•1 comments