frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Compressed Agents.md > Agent Skills

https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals
78•maximedupre•10h ago

Comments

ares623•2h ago
2 months later: "Anthropic introduces 'Claude Instincts'"
EnPissant•2h ago
This is confusing.

TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.

The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".

Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?

NitpickLawyer•2h ago
The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md
sally_glance•1h ago
I also thought this is how skills work, but in practice I experienced similar issues. The agents I'm using (Gemini CLI, Opencode, Claude) all seem to have trouble activating skills on their own unless explicitly prompted. Yeah, probably this will be fixed over the next couple of generations but right now dumping the documentation index right into the agent prompt or AGENTS.md works much better for me. Maybe it's similar to structured output or tool calls which also only started working well after providers specifically trained their models for them.
tottenhm•2h ago
> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it.

The agent passes the Turing test...

pietz•1h ago
Isn't it obvious that an agent will do better if he internalizes the knowledge on something instead of having the option to request it?

Skills are new. Models haven't been trained on them yet. Give it 2 months.

WA•1h ago
Not so obvious, because the model still needs to look up the required doc. The article glances over this detail a little bit unfortunately. The model needs to decide when to use a skill, but doesn’t it also need to decide when to look up documentation instead of relying on pretraining data?
sothatsit•1h ago
I believe the skills would contain the documentation. It would have been nice for them to give more information on the granularity of the skills they created though.
velcrovan•1h ago
Removing the skill does remove a level of indirection.

It's a difference of "choose whether or not to make use of a skill that would THEN attempt to find what you need in the docs" vs. "here's a list of everything in the docs that you might need."

rao-v•1h ago
In a month or three we’ll have the sensible approach, which is smaller cheaper fast models optimized for looking at a query and identifying which skills / context to provide in full to the main model.

It’s really silly to waste big model tokens on throat clearing steps

Calavar•1h ago
I thought most of the major AI programming tools were already doing this. Isn't this what subagents are in Claude code?
MillionOClock•1h ago
I don't know about Claude Code but in GitHub Copilot as far as I can tell the subagents are just always the same model as the main one you are using. They also need to be started manually by the main agent in many cases, whereas maybe the parent comment was referring about calling them more deterministically?
jryan49•1h ago
Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.
only-one1701•1h ago
This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering.
bluGill•1h ago
That is my experience. Sometimes the LLM gives good results, sometimes it does something stupid. You tell it what to do, and like a stubborn 5 year old it ignores you - even after it tries it and fails it will do what you tell it for a while and then go back to the thing that doesn't work.
sothatsit•1h ago
This seems like an issue that will be fixed in newer model releases that are better trained to use skills.
thom•1h ago
You need the model to interpret documentation as policy you care about (in which case it will pay attention) rather than as something it can look up if it doesn’t know something (which it will never admit). It helps to really internalise the personality of LLMs as wildly overconfident but utterly obsequious.
smcleod•1h ago
Sounds like they've been using skills incorrectly if they're finding their agents don't invoke the skills. I have Claude Code agents calling my skills frequently, almost every session. You need to make sure your skill descriptions are well defined and describe when to use them and that your tasks / goals clearly set out requirements that align with the available skills.
velcrovan•1h ago
I think if you read it, their agents did invoke the skills and they did find ways to increase the agents' use of skills quite a bit. But the new approach works 100% of the time as opposed to 79% of the time, which is a big deal. Skills might be working OK for you at that 79% level and for your particular codebase/tool set, that doesn't negate anything they've written here.
jgbuddy•1h ago
Am I missing something here?

Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.

However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.

It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)

orlandohohmeier•1h ago
I’ve been using symlinked agent files for about a year as a hacky workaround before skils became a thing load additional “context” for different tasks, and it might actually address the issue you’re talking about. Honestly, it’s worked so well for me that I haven’t really felt the need to change it.
observationist•1h ago
This is one of the reasons the RLM methodology works so well. You have access to as much information as you want in the overall environment, but only the things relevant to the task at hand get put into context for the current task, and it shows up there 100% of the time, as opposed to lossy "memory" compaction and summarization techniques, or probabilistic agent skills implementations.

Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.

jstummbillig•25m ago
> Obviously directly including context in something like a system prompt will put it in context 100% of the time.

How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md works seems to work "better" than however skills are implemented out of the box, for this use case.

cortesoft•9m ago
Isn't the difference that a skill means you just have to add the script name and explanation to the context instead of the entire script plus the explanation?
verdverm•6m ago
You aren't wrong, you really want a bit of both.

1. You absolutely want to force certain context in, no questions or non-determinism asked (index and sparknotes). This can be done conditionally, but still rule based on the files accessed and other "context"

2. You want to keep it clean and only provide useful context as necessary (skills, search, mcp; and really a explore/query/compress mechanism around all of this, ralph wiggum is one example)

thorum•1h ago
The article presents AGENTS.md as something distinct from Skills, but it is actually a simplified instance of the same concept. Their AGENTS.md approach tells the AI where to find instructions for performing a task. That’s a Skill.

I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.

verdverm•2m ago
Yea, I am now separating them based on

1. Those I force into the system prompt using rules based systems and "context"

2. Those I let the agent lookup or discover

CjHuber•1h ago
That feels like a stupid article. well of course if you have one single thing you want to optimize putting it into AGENTS.md is better. but the advantage of skills is exactly that you don't cram them all into the AGENTS file. Let's say you had 3 different elaborate things you want the agent to do. good luck putting them all in your AGENTS.md and later hoping that the agent remembers any of it. After all the key advantage of the SKILLs is that they get loaded to the end of the context when needed
sheepscreek•54m ago
It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.
newzino•42m ago
The compressed agents.md approach is interesting, but the comparison misses a key variable: what happens when the agent needs to do something outside the scope of its instructions?

With explicit skills, you can add new capabilities modularly - drop in a new skill file and the agent can use it. With a compressed blob, every extension requires regenerating the entire instruction set, which creates a versioning problem.

The real question is about failure modes. A skill-based system fails gracefully when a skill is missing - the agent knows it can't do X. A compressed system might hallucinate capabilities it doesn't actually have because the boundary between "things I can do" and "things I can't" is implicit in the training rather than explicit in the architecture.

Both approaches optimize for different things. Compressed optimizes for coherent behavior within a narrow scope. Skills optimize for extensibility and explicit capability boundaries. The right choice depends on whether you're building a specialist or a platform.

jstummbillig•23m ago
Why could you not have a combination of both?
delduca•35m ago
Ah nice… vercel is vibecoded
heliumtera•10m ago
web people opted into react, dude. that says a lot.

they used prisma to handle their database interactions. they preached tRPC and screamed TYPE SAFETY!!!

you really think these guys will ever again touch the keyboard to program? they despise programming.

BenoitEssiambre•29m ago
Wouldn't this have been more readable with a \n newline instead of a pipe operator as a seperator? This wouldn't have made the prompt longer.
ChrisArchitect•25m ago
Title is: AGENTS.md outperforms skills in our agent evals
heliumtera•19m ago
you are telling me that a markdown saying:

*You are the Super Duper Database Master Administrator of the Galaxy*

does not improve the model ability reason about databases?

verdverm•12m ago
This largely mirrors my experience building my custom agent

1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none

2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically

3. Have topical markdown files / skills

4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me

5. Iterate, experiment, do weird things, and have fun!

I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass

Project Genie: Experimenting with infinite, interactive worlds

https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/
380•meetpateltech•6h ago•190 comments

PlayStation 2 Recompilation Project Is Absolutely Incredible

https://redgamingtech.com/playstation-2-recompilation-project-is-absolutely-incredible/
162•croes•4h ago•60 comments

Claude Code daily benchmarks for degradation tracking

https://marginlab.ai/trackers/claude-code/
487•qwesr123•9h ago•252 comments

Grid: Forever free, local-first, browser-based 3D printing/CNC/laser slicer

https://grid.space/stem/
19•cyrusradfar•45m ago•1 comments

Drug trio found to block tumour resistance in pancreatic cancer

https://www.drugtargetreview.com/news/192714/drug-trio-found-to-block-tumour-resistance-in-pancre...
188•axiomdata316•7h ago•90 comments

Flameshot

https://github.com/flameshot-org/flameshot
79•OsrsNeedsf2P•3h ago•33 comments

Compressed Agents.md > Agent Skills

https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals
81•maximedupre•10h ago•39 comments

Launch HN: AgentMail (YC S25) – An API that gives agents their own email inboxes

99•Haakam21•6h ago•120 comments

Where to Sleep in LAX

https://cadence.moe/blog/2025-12-30-where-to-sleep-in-lax
19•surprisetalk•6d ago•6 comments

Retiring GPT-4o, GPT-4.1, GPT-4.1 mini, and OpenAI o4-mini in ChatGPT

https://openai.com/index/retiring-gpt-4o-and-older-models/
117•rd•2h ago•163 comments

The Value of Things

https://journal.stuffwithstuff.com/2026/01/24/the-value-of-things/
43•vinhnx•4d ago•19 comments

Cutting Up Curved Things (With Math)

https://campedersen.com/tessellation
6•ecto•49m ago•0 comments

County pays $600k to pentesters it arrested for assessing courthouse security

https://arstechnica.com/security/2026/01/county-pays-600000-to-pentesters-it-arrested-for-assessi...
233•MBCook•4h ago•121 comments

A lot of population numbers are fake

https://davidoks.blog/p/a-lot-of-population-numbers-are-fake
218•bookofjoe•9h ago•206 comments

Is the RAM shortage killing small VPS hosts?

https://www.fourplex.net/2026/01/29/is-the-ram-shortage-killing-small-vps-hosts/
90•neelc•7h ago•123 comments

Waymo robotaxi hits a child near an elementary school in Santa Monica

https://techcrunch.com/2026/01/29/waymo-robotaxi-hits-a-child-near-an-elementary-school-in-santa-...
258•voxadam•9h ago•458 comments

Show HN: Kolibri, a DIY music club in Sweden

https://kolibrinkpg.com/
25•EastLondonCoder•7h ago•7 comments

The WiFi only works when it's raining (2024)

https://predr.ag/blog/wifi-only-works-when-its-raining/
17•epicalex•2h ago•3 comments

Reflex (YC W23) Senior Software Engineer Infra

https://www.ycombinator.com/companies/reflex/jobs/Jcwrz7A-lead-software-engineer-infra
1•apetuskey•6h ago

EmulatorJS

https://github.com/EmulatorJS/EmulatorJS
80•avaer•6d ago•11 comments

My Mom and Dr. DeepSeek (2025)

https://restofworld.org/2025/ai-chatbot-china-sick/
109•kieto•4h ago•72 comments

How to choose colors for your CLI applications (2023)

https://blog.xoria.org/terminal-colors/
139•kruuuder•8h ago•79 comments

Box64 Expands into RISC-V and LoongArch territory

https://boilingsteam.com/box64-expands-into-risc-v-and-loong-arch-territory/
29•ekianjo•4d ago•2 comments

Deep dive into Turso, the "SQLite rewrite in Rust"

https://kerkour.com/turso-sqlite
93•unsolved73•8h ago•88 comments

Run Clawdbot/Moltbot on Cloudflare with Moltworker

https://blog.cloudflare.com/moltworker-self-hosted-ai-agent/
129•ghostwriternr•8h ago•45 comments

The Hallucination Defense

https://niyikiza.com/posts/hallucination-defense/
33•niyikiza•3h ago•80 comments

US cybersecurity chief leaked sensitive government files to ChatGPT: Report

https://www.dexerto.com/entertainment/us-cybersecurity-chief-leaked-sensitive-government-files-to...
364•randycupertino•7h ago•189 comments

AI's impact on engineering jobs may be different than expected

https://semiengineering.com/ais-impact-on-engineering-jobs-may-be-different-than-initial-projecti...
74•rbanffy•5h ago•134 comments

Apple buys Israeli startup Q.ai

https://techcrunch.com/2026/01/29/apple-buys-israeli-startup-q-ai-as-the-ai-race-heats-up/
77•ishener•2h ago•28 comments

Usenet personality

https://en.wikipedia.org/wiki/Usenet_personality
60•mellosouls•3d ago•28 comments