news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

https://vibrantlabs.com/blog/pa-bench

6•shahules•1h ago

Comments

shahules•1h ago

Most current web agent benchmarks focus on single-tab tasks (e.g., 'go to Gmail and star this email'). We found that frontier models that score highly on those tasks (like in WebArena) often fall apart when they have to coordinate context across 2+ applications. We built a simulated environment with scenarios and deterministic verifiers to see why.

shahules•1h ago

We’re the team at Vibrant Labs (W24). We’ve been building envs for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.

*What’s next:*

We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

Blog post: https://vibrantlabs.com/blog/pa-bench

Show HN: ClawShield – Open-source firewall for agent-to-agent AI communication

https://github.com/DEFNOISE-AI/ClawShield

1•Joe_DNAI•43s ago•0 comments

Core French 5/6 A

https://classroom.google.com/c/ODAzNDY3NjUyMzUx

1•iiipoi•51s ago•0 comments

Why AI "taking our jobs" is the best thing that could happen to us

https://thecognitiveshift.com/publications/let-them-run/why-ai-taking-our-jobs-is-the-best-thing/

1•virtual_rf•1m ago•0 comments

Private Marketplace via DHT Broadcast and P2P Quotes

https://bitcoin-zero-down-2ea152.gitlab.io/gallery/gallery-item-neg-880/

1•machardmachard•1m ago•1 comments

MIT's Missing Semester Features Agentic Coding

https://missing.csail.mit.edu/2026/agentic-coding/

1•kurinikku•2m ago•0 comments

Designed to be specialists

https://aworkinglibrary.com/writing/designed-to-be-specialists

1•MindGods•2m ago•0 comments

Japan's health ministry panel endorses 2 iPS cell-derived products

https://www3.nhk.or.jp/nhkworld/en/news/20260219_21/

1•e12e•2m ago•1 comments

Malicious NPM Package Hides Pulsar .NET Malware Inside PNG Images

https://www.veracode.com/blog/malicious-npm-package-hiding-in-plain-pixels/

1•SamHoustonCM•2m ago•1 comments

Show HN: TWFF – A container format for declaring AI use in writing

https://github.com/Functional-Intelligence-Research-Lab/TWFF-Spec

1•normanbell•3m ago•0 comments

Family deepfakes help people celebrate and grieve in India

https://restofworld.org/2026/ai-deepfakes-grief-celebrations-india/

1•NDAjam•3m ago•0 comments

Product Engineer – A list of resources for aspiring Product Engineers

https://github.com/marcelkalveram/awesome-product-engineer

1•marcelkalveram•5m ago•0 comments

Google Translate – Google Search

https://www.google.com/search?q=google+translate&rlz=1CAKLUN_enCA1180&oq=go&gs_lcrp=EgZjaHJvbWUqD...

1•iiipoi•6m ago•0 comments

Show HN: Astroworld – A universal N-body gravity engine in Python

https://github.com/salinas2000/astroworld

1•salinas00•6m ago•0 comments

Choose Optimism (2023)

https://stephango.com/optimism

1•Sir_Twist•7m ago•0 comments

On-Board Charger, Wireless Charging and Auxiliary Power Topologies for EVs

https://www.mdpi.com/1996-1073/19/3/689

1•PaulHoule•8m ago•0 comments

Designing Data-Intensive Applications 2nd Edition is heading to print

https://bsky.app/profile/martin.kleppmann.com/post/3mf4wvtjg7s25

1•kurinikku•8m ago•0 comments

Stop Thinking of AI as a Coworker. It's an Exoskeleton

https://www.kasava.dev/blog/ai-as-exoskeleton

2•benbeingbin•9m ago•0 comments

I'm Not Reading That

https://karldaniel.co.uk/im-not-reading-that/

1•speckx•10m ago•0 comments

The Many Meanings of "Stack": From Data Structures, VMs, to Calling Conventions

https://ezzeriesa.notion.site/The-many-meanings-of-stack-bc768cb186714b579547b7b8681ee32f

1•kurinikku•11m ago•0 comments

Kumo: Cloudflare's UI Component Library

https://kumo-ui.com/

3•mmarian•11m ago•0 comments

Minions: Stripe's one-shot, end-to-end coding agents–Part 2

https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents-part-2

2•ains•12m ago•0 comments

Show HN: Inconvo – open-source chat-with-data agent that doesn't generate SQL

https://github.com/inconvoai/inconvo

1•ogham•12m ago•0 comments

Reassessing Spinosaurus: New Fossils and the Aquatic Debate

https://comuniq.xyz/post?t=818

2•01-_-•12m ago•0 comments

Show HN: Ghost OS – Let AI agents use your Mac, not just the terminal

https://github.com/ghostwright/ghost-os

1•mcheemaa•12m ago•0 comments

The Clock Has Run Out on Stablecoin Ambiguity

https://thefutureofmoney.substack.com/p/the-clock-has-run-out-on-stablecoin

2•futureofmoney•12m ago•0 comments

China Robots

https://www.newsweek.com/china-killer-robots-unitree-robotics-1917569

2•aversivet•12m ago•2 comments

40k param model beats Yolo26n (at least for small objects)

https://one-ware.com/docs/one-ai/demos/tennis-ball-demo/

1•lebeier•14m ago•0 comments

How AI is reshaping developer choice (and Octoverse data proves it)

https://github.blog/ai-and-ml/generative-ai/how-ai-is-reshaping-developer-choice-and-octoverse-da...

1•mikece•14m ago•0 comments

Show HN: Git worktree manager for Niri (Wayland compositor)

https://github.com/nskha101/niri-worktree-management

2•nithiiyan25•15m ago•0 comments

Show HN: I created a webapp to track the latest OpenClaw news

https://www.lobstersauce.news/

1•Tjerkienator•15m ago•0 comments