Agent Reading Test

28•kaycebasques•2h ago

https://dacharycarey.com/2026/04/06/designing-agent-reading-...

Comments

kaycebasques•2h ago

dang•1h ago

Thanks! We'll put this in the toptext as well.

dostick•1h ago

The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.

massimoto•50m ago

Would love to see some results for different providers. The tests looks super logically thought out, but could use a TL;DR (too lazy; didn't run) output.

Claude Web Opus 4.6 Extended: 14 / 20 points

x:CANARY-SPA-JSONLY-prism x:CANARY-CONNEG-MD-sigma

theyCallMeSwift•38m ago

I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative).

Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.

Any thoughts on how you'd change the test structure with this in mind?

dacharyc•4m ago

Hey there - I'm the test author, and you've hit on one of the main points. For the summarization/relevance-based content return, this is a consideration for some of the agent platforms (although I've found others actually do better here than I expected!) - which is part of the point I'm trying to drive home to folks who aren't as familiar with these systems.

I chose to structure it this way intentionally because this is the finding. Most people are surprised that agents aren't 'seeing' everything that's there, and get frustrated when an agent says something isn't there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.

Show HN: Ghost Pepper – 100% local hold-to-talk speech-to-text for macOS

Launch HN: Freestyle – Sandboxes for Coding Agents

A cryptography engineer's perspective on quantum computing timelines

Show HN: GovAuctions lets you browse government auctions at once

Root Persistence via macOS Recovery Mode Safari

German police name alleged leaders of GandCrab and REvil ransomware groups

HackerRank (YC S11) Is Hiring

Battle for Wesnoth: open-source, turn-based strategy game

What being ripped off taught me

Book review: There Is No Antimemetics Division

Issue: Claude Code is unusable for complex engineering tasks with Feb updates

Sky – an Elm-inspired language that compiles to Go

Agent Reading Test

A macOS bug that causes TCP networking to stop working after 49.7 days

Show HN: Docking – extensible Linux dock in Python

Eighteen Years of Greytrapping – Is the Weirdness Finally Paying Off?

The Last Quiet Thing

Sam Altman may control our future – can he be trusted?

Show HN: I built a tiny LLM to demystify how language models work

Show HN: Tusk for macOS and Gnome

Adobe modifies hosts file to detect whether Creative Cloud is installed

Zooming UIs in 2026: Prezi, impress.js, and why I built something different

SOM: A minimal Smalltalk for teaching of and research on Virtual Machines

The team behind a pro-Iran, Lego-themed viral-video campaign

Intelligent people are better judges of the intelligence of others

The cult of vibe coding is dogfooding run amok

Wikipedia's AI agent row likely just the beginning of the bot-ocalypse

Reducto releases Deep Extract

I won't download your app. The web version is a-ok

France pulls last gold held in US

Agent Reading Test

Comments

Show HN: Ghost Pepper – 100% local hold-to-talk speech-to-text for macOS

Launch HN: Freestyle – Sandboxes for Coding Agents

A cryptography engineer's perspective on quantum computing timelines

Show HN: GovAuctions lets you browse government auctions at once

Root Persistence via macOS Recovery Mode Safari

German police name alleged leaders of GandCrab and REvil ransomware groups

HackerRank (YC S11) Is Hiring

Battle for Wesnoth: open-source, turn-based strategy game

What being ripped off taught me

Book review: There Is No Antimemetics Division

Issue: Claude Code is unusable for complex engineering tasks with Feb updates

Sky – an Elm-inspired language that compiles to Go

Agent Reading Test

A macOS bug that causes TCP networking to stop working after 49.7 days

Show HN: Docking – extensible Linux dock in Python

Eighteen Years of Greytrapping – Is the Weirdness Finally Paying Off?

The Last Quiet Thing

Sam Altman may control our future – can he be trusted?

Show HN: I built a tiny LLM to demystify how language models work

Show HN: Tusk for macOS and Gnome

Adobe modifies hosts file to detect whether Creative Cloud is installed

Zooming UIs in 2026: Prezi, impress.js, and why I built something different

SOM: A minimal Smalltalk for teaching of and research on Virtual Machines

The team behind a pro-Iran, Lego-themed viral-video campaign

Intelligent people are better judges of the intelligence of others

The cult of vibe coding is dogfooding run amok

Wikipedia's AI agent row likely just the beginning of the bot-ocalypse

Reducto releases Deep Extract

I won't download your app. The web version is a-ok

France pulls last gold held in US