AGENTS.md outperforms skills in our agent evals

https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals

521•maximedupre•1w ago

Comments

ares623•1w ago

2 months later: "Anthropic introduces 'Claude Instincts'"

EnPissant•1w ago

This is confusing.

TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.

The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".

Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?

NitpickLawyer•1w ago

The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md

sally_glance•1w ago

I also thought this is how skills work, but in practice I experienced similar issues. The agents I'm using (Gemini CLI, Opencode, Claude) all seem to have trouble activating skills on their own unless explicitly prompted. Yeah, probably this will be fixed over the next couple of generations but right now dumping the documentation index right into the agent prompt or AGENTS.md works much better for me. Maybe it's similar to structured output or tool calls which also only started working well after providers specifically trained their models for them.

tottenhm•1w ago

> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it.

The agent passes the Turing test...

cainxinth•1w ago

Even AI doesn’t RTFM

pylotlight•1w ago

It learnt from the best

deadbabe•1w ago

If humans would just RTFM they wouldn’t need AI.

Zambyte•1w ago

If AI would just RTFM it wouldn't need humans.

slekker•1w ago

But who would create AI?

xhcuvuvyc•1w ago

TFM

measurablefunc•1w ago

AI that don't read the manual.

Rapzid•1w ago

Legend has it, to this day, TFM has not been read.

dirkc•1w ago

these days TFM is generated from a prompt in any case

hu3•1w ago

even AI can't be bothered to read AI generated docs slop

seanhunter•1w ago

I can see the future. In a few years, HN will consist entirely of: 1) Bots posting “Show HN” of things they’ve vibecoded

2) Bots replying to those posts,

3) Bots asking whether the bots in #2 even read TFA, and finally

4) Bots posting the HN guideline where it says you shouldn’t ask people whether they have read TFA.

…And amid the smouldering ruins of civilization, the last human, dang, will be there, posting links to all the times this particular thing has been posted to HN before.

rglynn•1w ago

In the future?

joseda-hg•1w ago

God dang it, Dang!

BiteCode_dev•1w ago

You got me good with this one.

But seriously, this is my main answer to people telling me AI is not reliable: "guess what, most humans are not either, but at least I can tell AI to correct course and it's ego won't get in the way of fixing the problem".

In fact, while AI is not nearly as a good as a senior dev for non trivial tasks yet, it is definitely more reliable than most junior devs at following instructions.

bonesss•1w ago

Key differences, though:

Humans are reliably unreliable. Some are lazy, some sloppy, some obtuse, some all at once. As a tech lead you can learn their strengths and weaknesses. LLMs vacillate wildly while maintaining sycophancy and arrogance.

Human egos make them unlikely to admit error, sometimes, but that fragile ego also gives them shame and a vision of glory. An egotistical programmer won’t deliver flat garbage for fear of being exposed as inferior, and can be cajoled towards reasonable output with reward structures and clear political rails. LLMs fail hilariously and shamelessly in indiscriminate fashion. They don’t care, and will happily argue both sides of anything.

Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…”

If we’re going to make an analogy to a human, LLMs reliably act like absolute psychopaths with constant disassociation. They lie, lie about lying, and lie about following instructions.

I agree LLMs better than your average junior first time following first directives. I’m far less convinced about that story over time, as the dialog develops more effective juniors over time.

vidarh•1w ago

You can absolutely learn LLMs strengths and weaknesses too.

E.g. Claude gets "bored" easily (it will even tell you this if you give it too repetitive tasks). The solution is simple: Since we control context and it has no memory outside of that, make it pretend it's not doing repetitive tasks by having the top agent "only" do the task of managing and sub-dividing the task, and farm out each sub-task to a sub-agent who won't get bored because it only sees a small part of the problem.

> Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…”

No, like characters in a "Groundhog Day" scenario they also doesn't remember and change their behaviour while you figure out how to get them to do what you want, so you can test and adjust and find what makes them do what you want and it, and while not perfectly deterministic, you get close.

And unlike humans, sometimes the "not learning" helps us address other parts of the problem. E.g. if they learned, the "sub-agent trick" above wouldn't work, because they'd realise they were carrying out a bunch of tedious tasks instead of remaining oblivious that we're letting them forget in between each.

LLMs in their current form need harnesses, and we can - and are - learning which types of harnesses work well. Incidentally, a lot of them do work on humans too (despite our pesky memory making it harder to slip things past us), and a lot of them are methods we know of from the very long history of figuring out how to make messy, unreliable humans adhere to processes.

E.g. to go back to my top example of getting adherence to a boring, reptitive task: Create checklists, subdivide the task with individual reporting gates, spread it across a team if you can, put in place a review process (with a checklist). All of these are techniques that work both on human teams and LLMs to improve process adherence.

gjadi•1w ago

It's ego won't get in the way but it's lack of intelligence will.

Whereas a junior might be reluctant at first, but if they are smart they will learn and get better.

So maybe LLM are better than not-so-smart people, but you usually try to avoid hiring those people in the first place.

falcor84•1w ago

That's exactly the thing. Claude Code with Opus 4.5 is already significantly better at essentially everything than a large percentage of devs I had the displeasure of working with, including learning when asked to retain a memory. It's still very far from the best devs, but this is the worse it'll ever be, and it already significantly raised the bar for hiring.

vidarh•1w ago

> but this is the worse it'll ever be

And even if the models themselves for some reason were to never get better than what we have now, we've only scratched the surface of harnesses to make them better.

We know a lot about how to make groups of people achieve things individual members never could, and most of the same techiques work for LLMs, but it takes extra work to figure out how to most efficiently work around limitations such as lack of integrated long-term memory.

A lot of that work is in its infancy. E.g. I have a project I'm working on now where I'm up to a couple of dozens of agents, and ever day I'm learning more about how to structure them to squeeze the most out of the models.

One learning that feels relevant to the linked article: Instead of giving an agent the whole task across a large dataset that'd overwhelm context, it often helps to have an agent - that can use Haiku, because it's fine if its dumb - comb the data for <information relevant to the specific task>, and generate a list of information, and have the bigger model use that as a guide.

So the progress we're seeing is not just raw model improvements, but work like the one in this article: Figuring out how to squeeze the best results out of any given model, and that work would continue to yield improvements for years even if models somehow stopped improving.

pietz•1w ago

Isn't it obvious that an agent will do better if he internalizes the knowledge on something instead of having the option to request it?

Skills are new. Models haven't been trained on them yet. Give it 2 months.

WA•1w ago

Not so obvious, because the model still needs to look up the required doc. The article glances over this detail a little bit unfortunately. The model needs to decide when to use a skill, but doesn’t it also need to decide when to look up documentation instead of relying on pretraining data?

sothatsit•1w ago

I believe the skills would contain the documentation. It would have been nice for them to give more information on the granularity of the skills they created though.

velcrovan•1w ago

Removing the skill does remove a level of indirection.

It's a difference of "choose whether or not to make use of a skill that would THEN attempt to find what you need in the docs" vs. "here's a list of everything in the docs that you might need."

rao-v•1w ago

In a month or three we’ll have the sensible approach, which is smaller cheaper fast models optimized for looking at a query and identifying which skills / context to provide in full to the main model.

It’s really silly to waste big model tokens on throat clearing steps

Calavar•1w ago

I thought most of the major AI programming tools were already doing this. Isn't this what subagents are in Claude code?

MillionOClock•1w ago

I don't know about Claude Code but in GitHub Copilot as far as I can tell the subagents are just always the same model as the main one you are using. They also need to be started manually by the main agent in many cases, whereas maybe the parent comment was referring about calling them more deterministically?

jimmydoe•1w ago

Copilot is garbage, even MSFT employees I know all use cc. The only thing useful is you can route cc to use models in copilot sub which corp had a deal from their m365

MillionOClock•6d ago

On of the advantages of GitHub Copilot for me is that in terms of billing I find it very generous, depending on how you use it.

rao-v•1w ago

Sub-agents are typically one of the major models but with a specific and limited context + prompt. I’m talking about a small fast model focused on purely curating the skills / MCPs / files to provide to the main model before it kicks off.

Basically use a small model up front to efficiently trigger the big model. Sub agents are at best small models deployed by the bigger model (still largely manually triggered in most workflows today)

jryan49•1w ago

Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.

only-one1701•1w ago

This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering.

bluGill•1w ago

That is my experience. Sometimes the LLM gives good results, sometimes it does something stupid. You tell it what to do, and like a stubborn 5 year old it ignores you - even after it tries it and fails it will do what you tell it for a while and then go back to the thing that doesn't work.

CuriouslyC•1w ago

I always make a habit of doing a lot of duplicate runs when I benchmark for this reason. Joke's on me, in the time I spent doing 1 benchmark with real confidence intervals and getting no traction on my post, I could have done 10 shitty benchmarks or 1 shitty benchmark and 9x more blogspam. Perverse incentives rule us all.

sothatsit•1w ago

This seems like an issue that will be fixed in newer model releases that are better trained to use skills.

thom•1w ago

You need the model to interpret documentation as policy you care about (in which case it will pay attention) rather than as something it can look up if it doesn’t know something (which it will never admit). It helps to really internalise the personality of LLMs as wildly overconfident but utterly obsequious.

smcleod•1w ago

Sounds like they've been using skills incorrectly if they're finding their agents don't invoke the skills. I have Claude Code agents calling my skills frequently, almost every session. You need to make sure your skill descriptions are well defined and describe when to use them and that your tasks / goals clearly set out requirements that align with the available skills.

velcrovan•1w ago

I think if you read it, their agents did invoke the skills and they did find ways to increase the agents' use of skills quite a bit. But the new approach works 100% of the time as opposed to 79% of the time, which is a big deal. Skills might be working OK for you at that 79% level and for your particular codebase/tool set, that doesn't negate anything they've written here.

joebates•1w ago

It's still not always reliable.

I have a skill in a project named "determine-feature-directory" with a short description explaining that it is meant to determine the feature directory of a current branch. The initial prompt I provide will tell it to determine the feature directory and do other work. Claude will even state "I need to determine the feature directory..."

Then, about 5-10% of the time, it will not use the skill. It does use the skill most of the time, but the low failure rate is frustrating because it makes it tough to tell whether or not a prompt change actually improved anything. Of course I could be doing something wrong, but it does work most of the time. I miss deterministic bugs.

Recently, I stopped Claude after it skipped using a skill and just said "Aren't you forgetting something?". It then remembered to use the skill. I found that amusing.

JamesSwift•1w ago

I have a couple skills invoked with specific commands ('enter planning mode' and 'enter execution mode') and they have never failed to activate. Maybe make the activation a very rigid phrase and not implied to be a specific phrase.

jgbuddy•1w ago

Am I missing something here?

Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.

However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.

It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)

orlandohohmeier•1w ago

I’ve been using symlinked agent files for about a year as a hacky workaround before skils became a thing load additional “context” for different tasks, and it might actually address the issue you’re talking about. Honestly, it’s worked so well for me that I haven’t really felt the need to change it.

mbm•1w ago

What sort of files do you generally symlink in?

observationist•1w ago

This is one of the reasons the RLM methodology works so well. You have access to as much information as you want in the overall environment, but only the things relevant to the task at hand get put into context for the current task, and it shows up there 100% of the time, as opposed to lossy "memory" compaction and summarization techniques, or probabilistic agent skills implementations.

Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.

judahmeek•1w ago

> only the things relevant to the task at hand get put into context for the current task

And how do you guarantee that said relevant things actually get put into the context?

OP is about the same problem: relevant skills being ignored.

jstummbillig•1w ago

> Obviously directly including context in something like a system prompt will put it in context 100% of the time.

How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md seems to work better than however skills are implemented.

cortesoft•1w ago

Isn't the difference that a skill means you just have to add the script name and explanation to the context instead of the entire script plus the explanation?

sevg•1w ago

You could put the name and explanation in CLAUDE.md/AGENTS.md, plus the path to the rest of the skill that Claude can read if needed.

That seems roughly equivalent to my unenlightened mind!

verdverm•1w ago

I like to think about it this way, you want to put some high level, table of contents, sparknotes like stuff in the system prompt. This helps warm up the right pathways. In this, you also need to inform that there are more things it may need, depending on "context", through filesystem traversal or search tools, the difference is unimportant, other than most things outside of coding typically don't do filesystem things the same way

imiric•1w ago

The amount of discussion and "novel" text formats that accomplish the same thing since 2022 is insane. Nobody knows how to extract the most value out of this tech, yet everyone talks like they do. If these aren't signs of a bubble, I don't know what is.

stevenhuang•1w ago

It's a new technology under active development so people are simply sharing what works for them in the given moment.

> If these aren't signs of a bubble, I don't know what is.

This conclusion is incoherent and doesn't follow from any of your premises.

imiric•1w ago

Sure it does. Many people are jumping on ideas and workflows proposed by influencer personalities and companies, without actually evaluating how valid or useful they actually are. TFA makes this clear by saying that they were "betting on skills" and only later determined that they get better performance from a different workflow.

This is very similar to speculative valuations around the web in the late 90s, except this bubble is far larger, more mainstream and personal.

The fact that this is a debate about which Markdown file to put prompt information in is wild. It ultimately all boils down to feeding context to the model, which hasn't fundamentally changed since 2022.

verdverm•1w ago

1. There is nothing novel in my text formats, I'm just deciding what content and what files

2. I've actually done these things, seen the difference, and share it with others

Yes there are a lot of unknowns and a lot of people speaking from ignorance, but it is a mistake, perhaps even bigotry by definition, to make such blanket statements and judgemental about people

majormajor•1w ago

Their non-skill based "compressed index" is just similarly "Each line maps a directory path to the doc files it contains" but without "skillification." They didn't load all those things into context directly, just pointers.

They also didn't bother with any more "explanation" beyond "here are paths for docs."

But this straightforward "here are paths for docs" produced better results, and IMO it makes sense since the more extra abstractions you add, the more chance of a given prompt + situational context not connecting with your desired skill.

jmathai•1w ago

Skills have frontmatter which includes a name and description. The description is what determines if the llm finds the skill useful for the task at hand.

If your agent isn’t being used, it’s not as simple as “agents aren’t getting called”. You have to figure out how to get the agent invoked.

Spivak•1w ago

Sure, but then you're playing a very annoying and boring game of model-whispering to specific versions of models that are ever changing as well as trying to hopefully get it to respond correctly with who knows what user input surrounds it.

I really only think the game is worth playing when it's against a fixed version of a specific model. The amount of variance we observe between different releases of the same model is enough to require us to update our prompts and re-test. I don't envy anyone who has to try and find some median text that performs okay on every model.

bonesss•1w ago

About a year ago I made an ChatGPT and Claude based hobo RAG-alike solution for exploring legal cases, using document creation and LLMs to craft a rich context window for interrogation in the chat.

Just maintaining a basic interaction framework, consistent behaviours in chat when starting up, was a daily whack-a-mole where well-tested behaviours shift and alter without rhyme or reason. “Model whispering” is right. Subjectively it felt like I could feel Anthropic/OpenAI engineers twiddling dials on the other side.

Writing code that executes the same every time has some minor benefits.

verdverm•1w ago

You aren't wrong, you really want a bit of both.

1. You absolutely want to force certain context in, no questions or non-determinism asked (index and sparknotes). This can be done conditionally, but still rule based on the files accessed and other "context"

2. You want to keep it clean and only provide useful context as necessary (skills, search, mcp; and really a explore/query/compress mechanism around all of this, ralph wiggum is one example)

teknopaul•1w ago

My reading was that copying the doc's ToC in markdown + links was significantly more effective than giving it a link to the ToC and instructions to read it.

Which makes sense.

& some numbers that prove that.

_the_inflator•1w ago

I agree with you.

I think Vercel mixes skills and context configuration up. So the whole evaluation is totally misleading because it tests for two completely different use cases.

To sum it up: Vercel should us both files, agents.md is combination with skills. Both functions have two totally different purposes.

deaux•1w ago

You're right, the results are completely as expected.

The article also doesn't mention that they don't know how the compressed index output quality. That's always a concern with this kind of compression. Skills are just another, different kind of compression. One with a much higher compression rate and presumably less likely to negatively influence quality. The cost being that it doesn't always get invoked.

TeeWEE•1w ago

Indeed seems like Vercel completely missed the point about agents.

In Claude Code you can invoke an agent when you want as a developer and it copies the file content as context in the prompt.

singingbard•1w ago

So you’re not missing anything if you use Claude by yourself. You just update your local system prompt.

Instead it’s a problem when you’re part of a team and you’re using skills for standards like code style or architectural patterns. You can’t ask everyone to constantly update their system prompt.

Claude skill adherence is very low.

thorum•1w ago

The article presents AGENTS.md as something distinct from Skills, but it is actually a simplified instance of the same concept. Their AGENTS.md approach tells the AI where to find instructions for performing a task. That’s a Skill.

I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.

verdverm•1w ago

Yea, I am now separating them based on

1. Those I force into the system prompt using rules based systems and "context"

2. Those I let the agent lookup or discover

I also limit what gets into message parts, moving some of the larger token consumers to the system prompt so they only show once, most notable read/write_file

CjHuber•1w ago

That feels like a stupid article. well of course if you have one single thing you want to optimize putting it into AGENTS.md is better. but the advantage of skills is exactly that you don't cram them all into the AGENTS file. Let's say you had 3 different elaborate things you want the agent to do. good luck putting them all in your AGENTS.md and later hoping that the agent remembers any of it. After all the key advantage of the SKILLs is that they get loaded to the end of the context when needed

sheepscreek•1w ago

It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.

verdverm•1w ago

I've done very similar things with my custom agent that uses Gemini and have gotten very similar results. Working on the evals to back that claim up

delduca•1w ago

Ah nice… vercel is vibecoded

heliumtera•1w ago

web people opted into react, dude. that says a lot.

they used prisma to handle their database interactions. they preached tRPC and screamed TYPE SAFETY!!!

you really think these guys will ever again touch the keyboard to program? they despise programming.

dca88•1w ago

This. I read this article and it pains me to see the amount of manpower put into doing anything but actually getting work done.

BenoitEssiambre•1w ago

Wouldn't this have been more readable with a \n newline instead of a pipe operator as a seperator? This wouldn't have made the prompt longer.

ChrisArchitect•1w ago

Title is: AGENTS.md outperforms skills in our agent evals

heliumtera•1w ago

you are telling me that a markdown saying:

*You are the Super Duper Database Master Administrator of the Galaxy*

does not improve the model ability reason about databases?

verdverm•1w ago

This largely mirrors my experience building my custom agent

1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none

2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically

3. Have topical markdown files / skills

4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me

5. Iterate, experiment, do weird things, and have fun!

I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass

aktau•5d ago

> I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass.

Can you detail this a bit more? Do you put the actual contents of the file in the system prompt? Forever?

meatcar•1w ago

What if instead of needing to run a codemod to cache per-lib docs locally, documentation could be distributed alongside a given lib, as a dev dependency, version locked, and accessible locally as plaintext. All docs can be linked in node_modules/.docs (like binaries are in .bin). It would be a sort of collection of manuals.

What a wonderful world that would be.

tobyjsullivan•1w ago

Sounds a bit like man pages. I think you’re onto something.

chr15m•1w ago

I'm not sure if this is widely known but you can do a lot better even than AGENTS.md.

Create a folder called .context and symlink anything in there that is relevant to the project. For example READMEs and important docs from dependencies you're using. Then configure your tool to always read .context into context, just like it does for AGENTS.md.

This ensures the LLM has all the information it needs right in context from the get go. Much better performance, cheaper, and less mistakes.

d3m0t3p•1w ago

Yea but the goal it not to bloat the context space. Here you "waste" context by providing non usefull information. What they did instead is put an index of the documentation into the context, then the LLM can fetch the documentation. This is the same idea that skills but it apparently works better without the agentic part of the skills. Furthermore instead of having a nice index pointing to the doc, They compressed it.

bmitc•1w ago

What does it mean to waste context?

bagels•1w ago

The context window is finite. You can easily fill it with documentation and have no room left for the code and question you want to work on. It also means more tokens sent with every request, increasing cost if you're paying by the token.

therealpygon•1w ago

Context quite literally degrades performance of attention with size in non-needle-in-haystack lookups in almost every model to varying degrees. Thus to answer the question, the “waste” is making the model dumber unnecessarily in an attempt to make it smarter.

PKop•1w ago

Think of context switching when you yourself are programming. You can only hold some finite amount of concepts in your head at one time. If you have distractions, or try to focus on too many things at once, your ability to reason about your immediate problem degrades. Think also of legacy search engines: often, a more limited and focused search query vs a query that has too many terms, more precisely maps to your intended goal.

LLM's have always been at any time limited in the amount of tokens it can process at one time. This is increasing, but one problem is chat threads continually increase in size as you send messages back and forth because within any session or thread you are sending the full conversation to the LLM every message (aside from particular optimizations that compact or prune this). This also increases costs which are charged per token. Efficiency of cost and performance/precision/accuracy dictates using the context window judiciously.

chr15m•1w ago

The minification is a great idea. Will try this.

Their approach is still agentic in the sense that the LLM must make a tool cool to load the particular doc in. The most efficient approach would be to know ahead of time which parts of the doc will be needed, and then give the LLM a compressed version of those docs specifically. That doesn't require an agentic tool call.

Of course, it's a tradeoff.

gbnwl•1w ago

Cheaper? Loading every bit of documentation into context every time, regardless of whether it’s relevant to the task the agent is working on? How? I’d much rather call out the location of relevant docs in Claude.md or Agents.md and tell the agent to read them only when needed.

chr15m•1w ago

As they point out in the article, that approach is fragile.

Cheaper because it has the right context from the start instead of faffing about trying to find it, which uses tokens and ironically bloats context.

It doesn't have to be every bit of documentation, but putting the most salient bits in context makes LLMs perform much more efficiently and accurately in my experience. You can also use the trick of asking an LLM to extract the most useful parts from the documentation into a file, which you then re-use across projects.

https://github.com/chr15m/ai-context

gbnwl•1w ago

> Extracting the most useful parts of documentation into a file

Yes, and this file becomes: also documentation. I didn’t mean throw entire unabridged docs at it, I should’ve been more clear. All of my docs for agents are written by agents themselves. Either way once the project becomes sufficiently complex it’s just not going to be feasible to add a useful level of detail of every part of it into context by default, the context window will remain fixed as your project grows. You will have to deal with this limit eventually.

I DO include a broad overview of the project in Agents or Claude.md by default, but have supplemental docs I point the agent to when they’re working on a particular aspect of the project.

chr15m•1w ago

> sufficiently complex

Sounds like we are working on different types of projects. I avoid complexity at almost all cost and ruthlessly minimise LoC and infrastructure. I realise that's a privilege, and many programmers can't.

TeeWEE•1w ago

This is quite a bad idea. You need to control the size and quality of your context by giving it one file that is optimized.

You don’t want to be burning tokens and large files will give diminishing returns as is mentioned in the Claude Code blog.

chr15m•1w ago

It is not an "idea" but something I've been doing for months and it works very well. YMMV. Yes, you should avoid large files and control the size and quality of your context.

epolanski•1w ago

Docs of dependencies aren't that much of a game changer. Multiple frameworks and libraries have been releasing llm.txt compressed versions of their docs from ages, and it doesn't make that much of a difference (I mean, it does, but not crucial as LLMs can find the docs on their own even online if needed).

What's actually useful is to put the source code of your dependencies in the project.

I have a `_vendor` dir at the root, and inside it I put multiple git subtrees for the major dependencies and download the source code for the tag you're using.

That way the LLM has access to the source code and the tests, which is way more valuable than docs because the LLM can figure out how stuff works exactly by digging into it.

thevinter•1w ago

I'm a bit confused by their claims. Or maybe I'm misunderstanding how Skills should work. But from what I know (and the small experience I had with them), skills are meant to be specifications for niche and well defined areas of work (i.e. building the project, running custom pipelines etc.)

If your goal is to always give a permanent knowledge base to your agent that's exactly what AGENTS.md is for...

meeech•1w ago

question: anyone recognize that eval UI or is it something they made in-house?

songodongo•1w ago

> When it needs specific information, it reads the relevant file from the .next-docs/ directory.

I guess you need to make sure your file paths are self-explanatory and fairly unique, otherwise the agent might bring extra documentation into the context trying to find which file had what it needed?

w10-1•1w ago

The key finding is that "compression" of doc pointers works.

It's barely readable to humans, but directly and efficiently relevant to LLM's (direct reference -> referent, without language verbiage).

This suggests some (compressed) index format that is always loaded into context will replace heuristics around agents.md/claude.md/skills.md.

So I would bet this year we get some normalization of both the indexes and the referenced documentation (esp. matching terms).

Possibly also a side issue: API's could repurpose their test suites as validation to compare LLM performance of code tasks.

LLM's create huge adoption waves. Libraries/API's will have to learn to surf them or be limited to usage by humans.

ai-christianson•1w ago

They say compressed... but isn't this just "minified"?

ethmarks•1w ago

Minification is still a form of compression, it just leaves the file more readable than more powerful compression methods (such as ZIP archives).

throwaway314155•1w ago

I'd say minification/summarization is more like a lossy, semantic compression. This is only relevant to LLM's and doesn't really fit more classical notions of compression. Minification would definitely be a clearer term, even if compression _technically_ makes sense.

jcheng•1w ago

Would’ve been perfectly readable and no larger if they had used newline instead of pipe.

postalcoder•1w ago

That's not the only useful takeaway. I found this to be true:

  > "Explore project first, then invoke skill" [produces better results than] "You MUST invoke the skill".

I recently tried to get Antigravity to consistently adhere to my AGENTS.md (Antigravity uses GEMINI.md). The agent consistently ignored instructions in GEMINI.md like:

- "You must follow the rules in [..]/AGENTS.md"

- "Always refer to your instructions in [..]/AGENTS.md"

Yet, this works every time: "Check for the presence of AGENTS.md files in the project workspace."

This behavior is mysterious. It's like how, in earlier days, "let's think, step by step" invoked chain-of-thought behavior but analogous prompts did not.

Izkata•1w ago

An idea: The first two are obviously written as second-person commands, but the third is ambiguous and could be interpreted as a first-person thought. Have you tried the first two without the "you must" and "your", to also change them to sort-of first-person in the same way?

postalcoder•1w ago

Solid intuition. Testing this on antigravity is a chore because I'm not sure if I have to kill the background agent to force a refresh of the GEMINI.md file so I just did it anyway.

  +------------------+------------------------------------------------------+
  | Success/Attempts | Instructions                                         |
  +------------------+------------------------------------------------------+
  | 0/3              | Follow the instructions in AGENTS.md.                |
  +------------------+------------------------------------------------------+
  | 3/3              | I will follow the instructions in AGENTS.md.         |
  +------------------+------------------------------------------------------+
  | 3/3              | I will check for the presence of AGENTS.md files in  |
  |                  | the project workspace. I will read AGENTS.md and     |
  |                  | adhere to its rules.                                 |
  +------------------+------------------------------------------------------+
  | 2/3              | Check for the presence of AGENTS.md files in the     |
  |                  | project workspace. Read AGENTS.md and adhere to its  |
  |                  | rules.                                               |
  +------------------+------------------------------------------------------+

In this limited test, seems like the first person makes a difference.

vidarh•1w ago

Thanks for this (and to Izkata for the suggestion). I now have about 100 (okay, minor exaggeration, but not as much as you'd like it to be) AGENTS.md/CLAUDE.md files and agent descriptions I will want to systematically validate if shifting toward first person helps adherence for...

I'm realising I need to start setting up an automated test-suite for my prompts...

pegasus•1w ago

Those of us who've ventured this far into the conversation would appreciate if you'd share your findings with us. Cheers!

naasking•1w ago

That's really interesting. I ran this scenario through GPT-5.1 and the reasoning it gave made sense, which essentially boils down to: in tools like Claude Code, Gemini Codex, and other “agentic coding” modes, the model isn’t just generating text, it’s running a planner, and the first-person form conforms to the expectation of a step in a plan, where the other modes are more ambiguous.

Izkata•1w ago

My suggestion was just straight text generation and thinking about what the training data might look like (imagining a narrative in a story): Commands between two people might not be followed right away or at all (and may even risk introducing rebellion and doing the opposite), while a first-person perspective is likely self motivation (almost guaranteed to do it) and may even be descriptive while doing it.

baby•1w ago

ln -s

inadequatespace•1w ago

Interesting. It's almost like models don't like being ordered around rudely with this "must” language.

Perhaps what they've learned from training data is “must” often occurs in cases with bullshit red tape or other regulations. "You must read the terms and conditions before using this stuff," or something like that, which are actually best ignored.

seunosewa•1w ago

Most llms.txt are very similar to the compressed docs.

keeganpoppen•1w ago

i dont know why, but this just feels like the most shallow “i compare llms based on the specs” kind of analysis you can get… it has extreme “we couldn’t get the llm to intuit what we wanted to do, so we assumed that it was a problem with the llm and we overengineered a way to make better prompts completely by accident” energy…

smrtinsert•1w ago

Are people running into mismatched code vs project a lot? I've worked on python and java codebases with claude code and have yet to run into a version mismatch issue. I think maybe once it got confused on the api available in python, but it fixed it by itself. From other blog posts similar to this it would seem to be a widespread problem, but I have yet to see it as a big problem as part of my day job or personal projects.

hahahahhaah•1w ago

Next.js sure makes a good benchmark for AI capability (and for clarity... this is not a compliment).

AndyNemmity•1w ago

My experience agrees with this.

Which is why I use a skill that is a command, that routes requests to agents and skills.

jascha_eng•1w ago

This does not normalize for tokens used if their skill description was as large as the docs index and contained all the reasons the LLM might want to use the skill, it likely performs much better than just one sentence as well.

gpm•1w ago

Compressing information in AGENTS.md makes a ton of sense, but why are they measuring their context in bytes and not tokens!?

denolfe•1w ago

PreSession Hook from obra/superpowers injects this along with more logic for getting rid of rationalizing out of using skills:

> If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill. IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.

While this may result in overzealous activation of skills, I've found that if I have a skill related, I _want_ to use it. It has worked well for me.

stingraycharles•1w ago

I always say “invoke your <x> skill to do X. then invoke your <y> skill to do Y. “

works pretty well

tanishqkanc•1w ago

this is only gonna be an issue until the next gen models where the labs will aggressively post train the models to proactively call skills

minimal_action•1w ago

It's very interesting but presenting success rates without any measure of the error, or at least inline details about the number of iterations is unprofessional. Especially for small differences or when you found the "same" performance.

onnimonni•1w ago

Would someone know if their eval tests are open source and where I could find them? Seems useful for iterating on Claude Code behaviour.

JamesSwift•1w ago

I also was looking for specific info on the evals, because I wanted to see if they were separately confirming that shoving the skills into the main context didnt degrade the non-skills evals. Thats the other side of skills other than ability to the thing, they dont pollute the main context window with unnecessary information.

holocen•1w ago

Prompted and built a bit of an extension of skills.sh with https://passivecontext.dev it basically just takes the skill and creates that "compressed" index. Still have to install the skill and all that, but might give others a bit of a short cut to experiment with.

armcat•1w ago

Firstly this is great work from Vercel - I am especially impressed with the evals setup (evals are the most undervalued component in any project IMO). Secondly the result is not surprising and I’ve seen consistently the increase in performance when you always include an index (or in my case, Table of Contents as a json structure) in your system prompt. Applying this outside of coding agents (like classic document retrieval) also works very well!

underlines•1w ago

Oh got, this scales bad and bloats your context window!

Just create an MCP server that does embedding retrieval or agentic retrieval with a sub agent on your framework docs.

Finally add an instruction to AGENT.md to look up stuff using that MCP.

someguyiguess•1w ago

The problem is that Agents.md is only read on initial load. Once context grows too large the agent will not reload the md file and loses / forgets the info from Agents.md.

bushbaba•1w ago

Why you try and avoid re using the same session beyond the initial task or two

taberiand•1w ago

Other comments suggest that the Agents.md is read into the system prompt and never leaves the context. But it's better to avoid excessive context regardless

remify•1w ago

That's the thing that bothers me here. They loaded the doc of course it will work but as your project grows you won't be able to put all your documentation in there (at least with current context handling).

Skills are still very much relevant on big and diverse projects.

bandrami•1w ago

Blackbox oracles make bad workflows, and tend to produce a whole lot of cargo culting. It's this kind of opacity (why does the markdown outperform agents? there's no real way to find out, even with a fully open or house model because the nature of the beast is that the execution path in a model can't be predicted) that makes me shy away from saying LLMs are "just another tool". If I can't see inside it -- and if even the vendor can't really see inside of it -- there's something fundamentally different.

wakeless•1w ago

I did a similar set of evals myself utilising the baseline capabilities that Phoenix (elixir) ships with and then skillified them.

Regularly the skills were not being loaded and thus not utilised. The outputs themselves were fine. This suggested that at some stage through the improvements of the models that baseline AGENTS.md had become redundant.

msp26•1w ago

This doesn't surprise me.

I have a SKILL.md for marimo notebooks with instructions in the frontmatter to always read it before working with marimo files. But half the time Claude Code still doesn't invoke it even with me mentioning marimo in the first conversation turn.

I've resorted to typing "read marimo skill" manually and that works fine. Technically you can use skills with slash commands but that automatically sends off the message too which just wastes time.

But the actual concept of instructions to load in certain scenarios is very good and has been worth the time to write up the skill.

shinhyeok•1w ago

But aren't you guys released skills.sh?

whinvik•1w ago

When we were trying to build our own agents we put quite a bit of effort on evals which was useful.

But switching over to using coding agents we never did the same. Feels like building an eval set will be an important part of what engg orgs do going forward.

motoboi•1w ago

Models are not AGI. They are text generators forced to generate text in a way useful to trigger a harness that will produce effects, like editing files or calling tools.

So the model won’t “understand” that you have a skill and use it. The generation of the text that would trigger the skill usage is made via Reinforcement Learning with human generated examples and usage traces.

So why don’t the model use skills all the time? Because it’s a new thing, there is not enough training samples displaying that behavior.

They also cannot enforce that via RL because skills use human language, which is ambiguous and not formal. Force it to use skills always via RL policy and you’ll make the model dumber.

So, right now, we are generating usage traces that will be used to train the future models to get a better grasp of when to use skills not. Just give it time.

AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.

bzGoRust•1w ago

I completed agree with your point

DanOpcode•1w ago

What's RL?

jacobkg•1w ago

Reinforcement Learning https://en.wikipedia.org/wiki/Reinforcement_learning

wahnfrieden•1w ago

Reinforcement learning

anal_reactor•1w ago

> Models are not AGI

https://en.wikipedia.org/wiki/GNU/Linux_naming_controversy

vidarh•1w ago

> AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.

The skills frontmatter end up in context as well.

If AGENTS.md outperform skills in a given agent, it is down to specifically how the skills frontmatter is extracted and injected into the context, because that is the only difference between the two approaches.

EDIT: I haven't tried to check this so this is pure speculation, but I suppose there is the possibility that some agents might use a smaller model to selectively decide what skills frontmatter to include in context for a bigger model. E.g. you could imagine Claude passing the prompt + skills frontmatter to Haiku to selectively decide what to include before passing to Sonnet or Opus. In that case, depending on approach, putting it directly in AGENTS.md might simply be a question of what information is prioritised in the ouput passed to the full model. (Again: this is pure speculation of a possible approach; though it is one I'd test if I were to pick up writing my own coding agent again)

But really the overall point is that AGENTS.md vs. skills here still is entirely a question of what ends up in the "raw" context/prompt that gets passed to the full model, so this is just nuance to my original answer with respect to possible ways that raw prompt could be composed.

OJFord•1w ago

No it's more than that - they didn't just put the skills instructions directly in AGENTS.md, they put the whole index for the docs (the skill in this case being a docs lookup) in there, so there's nothing to 'do', the skill output is already in context (or at least pointers to it, the index, if not the actual file contents) not just the front matter.

Hence the submission's conclusion:

> Our working theory [for why this performs better] comes down to three factors.

> No decision point. With AGENTS.md, there's no moment where the agent must decide "should I look this up?" The information is already present.

> Consistent availability. Skills load asynchronously and only when invoked. AGENTS.md content is in the system prompt for every turn.

> No ordering issues. Skills create sequencing decisions (read docs first vs. explore project first). Passive context avoids this entirely.

vidarh•1w ago

> No it's more than that - they didn't just put the skills instructions directly in AGENTS.md, they put the whole index for the docs (the skill in this case being a docs lookup) in there, so there's nothing to 'do', the skill output is already in context (or at least pointers to it, the index, if not the actual file contents) not just the front matter.

The point remains: That is still just down to how you compose the context/prompt that actually goes to the model.

Nothing stops an agent from including logic to inline the full set of skills if the context is short enough. The point of skills is to provide a mechanism for managing context to reduce the need for summarization/compaction or explicit management, and so allowing you to e.g. have a lot of them available.

(And this kind of makes the article largely moot - it's slightly neat to know it might be better to just inline the skills if you have few enough that they won't seriously fill up your context, but the main value of skills comes when you have enough of them that this isn't the case)

Conversely, nothing prevents the agent from using lossy processing with a smaller, faster model on AGENTS.md either before passing it to the main model e.g. if context is getting out of hand, or if the developer of a given agent think they have a way of making adherence better by transforming them.

These are all tooling decisions, not features of the models.

OJFord•1w ago

However you compose the context for the skill, the model has to generate output like 'use skill docslookup(blah)' vs. just 'according to the docs in context' (or even 'read file blah.txt mentioned in context') which training can affect.

vidarh•1w ago

This is assuming you make the model itself decide whether the skill is relevant, and that is one way of doing it, but there is no reason that needs to be the case.

Of course training can affect it, but the point is that there is nothing about skills that need to be different to just sending all the skill files as part of the context, because that is a valid way of implementing skills, though it looses the primary benefit of skills, namely the ability to have more documentation of how to do things than fits in context.

Other options that also do not require the main model to know what to include ranges from direct string matching (e.g. against /<someskill>) via embeddings, to passing a question to a smaller model (e.g "are any of these description relevant to this prompt: ...").

seunosewa•1w ago

What if they used the same compressed documentation in the skill? That would be just fine too.

OJFord•1w ago

Sure but it would be a trivial comparison then, this is really about context vs tool-calling.

baby•1w ago

I was thinking about that these says and experimenting like so: a system prompt that asks the agent to load any skills that seem relevant early, and a user prompt that asks the agent to do that later when a skill becomes relevant

themoose8•1w ago

Indeed, they're not AGI. They're basically autocomplete on steroids.

They're very useful, but as we all know - they're far from infallible.

We're probably plateauing on the improvement of the core GPT technology. For these models and APIs to improve, it's things like Skills that need to be worked on and improved, to reduce those mistakes that it makes and produce better output.

So it's pretty disappointing to see that the 'Skills' feature set as implemented, as great of a concept as it is, is pretty bogus compared to just front loading the AGENTS.md file. This is not obvious and valuable to know.

coldtea•1w ago

>Indeed, they're not AGI. They're basically autocomplete on steroids.

This makes the assumption that AGI is not autocomplete of steroids, which even before LLMs was a very plausible suggested mechanism for what intelligence is.

whattheheckheck•1w ago

They haven't even released the full complete retrain on the entire corpus of what they have in the training data. They have billions of chats detailing precisely a high fidelity map of the inner workings of millions of people psychologically. The next ones gonna be a banger + the non lobotimized one for the military

js8•1w ago

> Models are not AGI.

How do you know? What if AGI can be implemented as a reasonably small set of logic rules, which implement what we call "epistemology" and "informal reasoning"? And this set of rules is just being run in a loop, producing better and better models of reality. It might even include RL, for what we know.

And what if LLMs already know all these rules? So they are AGI-complete without us knowing.

To borrow from Dennett, we understand LLMs from the physical stance (they are neural networks) and the design stance (they predict next token of language), but do we understand them from an intentional stance, i.e. what rules they employ when they running chain-of-thought for example?

blueprint•1w ago

It's very simple. The model itself doesn't know and can't verify it. It knows that it doesn't know. Do you deny that? Or do you think that a general intelligence would be in the habit of lying to people and concealing why? At the end of the day, that would be not only unintelligent, but hostile. So it's very simple. And there is such a thing as "the truth", and it can be verified by anyone repeatably in the requisite (fair, accurate) circumstances, and it's not based in word games.

coldtea•1w ago

None of the above are even remotely epistemologically sound.

"Or do you think that a general intelligence would be in the habit of lying to people and concealing why?"

First, why couldn't it? "At the end of the day, that would be not only unintelligent, but hostile" is hardly an argument against it. We ourselves are AGI, but we do both unintelligent and hostile actions all the time. And who said it's unintelligent to begin with? As in AGI it might very well be in my intelligent self-interests to lie about it.

Second, why is "knows it and can verify" a necessary condition? An AGI could very well not know it's one.

>And there is such a thing as "the truth", and it can be verified by anyone repeatably in the requisite (fair, accurate) circumstances, and it's not based in word games.

Epistemologically speaking, this is hardly the slam-dunk argument you think it is.

blueprint•1w ago

no, you missed some of my sentences. you have to take the whole picture together. and I was not making an argument to you to prove the existence of the truth. You are clearly bent on arguing against its existence, which tells me enough about you. We were talking about agents that operate in good faith that know that they are safe. When you're ready to have a discussion in good faith rather than attempting to find counterarguments, then you will find that what I said is verifiable. The question is not whether you think you can come up with a way to make an argument that sounds like it contradicts what I said.

The question is not whether an AGI knows that it is an AGI. The question is whether it knows that it is not one. And you're missing the fact that there's no such thing as it here.

If you go around acting hostile to good people that's still not very intelligent. In fact, I would question if you have any concept of why you're doing it at all. chances are you're doing it to run from yourself not because you know what you're doing.

Anyway, you're just speculating and the fact of the matter is that you don't have to speculate. If you actually wanted to verify what I said, it would be very easy to do so. it's not a surprise that someone who doesn't want to know something will have deaf ears. so I'm not going to pretend that I stand a chance of convincing you when I already know that my argument is accurate.

don't be so sure that you meet the criteria for AGI.

and as for my slam dunk, any attempt to argue against the existence of truth, automatically validates your assumption of its existence. so don't make the mistake of assuming I had to argue about it. I was merely stating a fact.

coldtea•1w ago

>no, you missed some of my sentences. you have to take the whole picture together. and I was not making an argument to you to prove the existence of the truth. You are clearly bent on arguing against its existence, which tells me enough about you. We were talking about agents that operate in good faith that know that they are safe. When you're ready to have a discussion in good faith rather than attempting to find counterarguments, then you will find that what I said is verifiable. The question is not whether you think you can come up with a way to make an argument that sounds like it contradicts what I said. (...) don't be so sure that you meet the criteria for AGI

Sorry, I'm not interested in replying to ad-hominem jabs and insults, when I made perfectly clear (if basic) and non-personal arguments.

In any case, your comments ignore about all of epistemology and just take for granted whatever naive folk epistemology you have arrived at, and you're not interested in counter-arguments anyway, so, have a nice life.

js8•1w ago

All I asked for was the OP to substantiate their claim that LLMs are not AGI. I am agnostic on that - either way seems plausible.

I don't think there even is an agreed criterion of what AGI is. Current models can easily pass the Turing test (except some gotchas, but these don't really test intelligence).

blueprint•6d ago

What people hope 'AGI' is would at least be able to make confirmations of fact and know what verification means. LLMs don't have 'knowledge' and do not actually 'reason'. Heuristic vs simulation. One can be made to approach the other, but only on a specific and narrow path. Someone who knows something can verify that they know it. An "intelligence" implies it is doing operations based on rules, but LLMs cannot conform themselves to rules that require them to reason everything through. What people have hoped AGI would be could be trained to reliably adopt the practice of reasoning. Necessary but maybe not sufficient, and I'm just gonna blame that on the term "intelligence" actually indicating a still relatively low level of what I will "consciousness".

js8•5d ago

I don't really follow what you're saying, so I'll keep it short. I have used Claude Opus 4.5 for coding and it certainly has knowledge and can reason.

You're wrong on reliability. Humans are also quite unreliable, and formal reasoning systems in silico can actually fail too (due to e.g. cosmic rays), the probability is just astronomically low.

And in engineering, we know quite well how to take a system that is less than 50% unreliable and turn it into something with any degree of reliability - we just run it over and over and verify it gives identical results.

And Claude Code (as an LLM harness) can do this. It can write tests. It can check if program is running correctly (giving expected result). It can be made to any degree of reliability you desire. We've crossed that 50% threshold.

The same happens when models are learning. They start with heuristics, but eventually they'll learn and generalize enough to learn whatever formal rules of logic and reasoning, and to apply them with high degree of reliability. Again, we've probably crossed that threshold, which is confirmed by experience of many users that models are getting more and more reliable with each iteration.

Does it make me uneasy that I don't know what the underlying learned formal reasoning system is? Yes. But that doesn't mean it's not AGI.

blueprint•4d ago

> It can be made to any degree of reliability you desire.

Absolutely false statement.

carterschonwald•1w ago

static linking va dynamic but we dont know the actual config and setup. and also the choice of totally changes the problem

tdiff•1w ago

Is not that model-dependant? Skimmed through, but did not find which model the tests were run with.

rcarmo•1w ago

Everything outperforms skills if the system prompt doesn’t prioritize them. No news here.

psychoslave•1w ago

Over the last week I went with a bigger dig on using agent mode et work, and my experiment align with this observation.

The first thing that surprising to me is how much the default tuning are leaned toward laudative stances, the user is always absolutely right, what was done is solving everything expected. But actually no, not a single actual check was done, a tone of code was produced but the goal is not at all achieved and of course many regressions now lure in the code base, when it's not straight breaking everything (which is at least less insidious).

The thing that is surprising to me, is that it can easily drop thousands of lines of tests, and then it can be forced to loop over these tests until it succeed. In my experiments it still drop far too much noise code, but at least the burden of checking if it looks like it makes any sense is drastically reduced.

hu3•1w ago

That's my observation too.

And I have been trying to improve the framework and abstractions/types to reduce the lines of code required for LLMs to create features in my web app.

Did the LLM really needed to spit 1k lines for this feature? Could I create abstractions to make it feasible in under 300 lines?

Of course there's cost and diminishing returns to abstractions so there are tradeoffs.

underdeserver•1w ago

I don't think you can really learn from this experiment unless you specify which models you used, if you tried it against at least 3 frontier models, if you ran each eval multiple times, and what prompts you tried.

These things are non-deterministic across multiple axes.

user3939382•1w ago

I’m working on an AGI model that will make the discussion of skills look silly. Skills strikes in the right direction in some sense but it’s an extremely weak 1% echo of what’s actually needed to solve this problem.

epolanski•1w ago

I'm working on stuff in a similar space.

I need to evaluate how do different project scaffolding impacts the results of Claude Code/Opencode (either with Anthropic models or third party) for agentic purpose.

But I am unsure on how should I be testing and it's not very clear how did Vercel proceeded here.

guluarte•1w ago

In my experience, agents only follow the first two or three lines of AGENTS.md + message. As the context grows, they start following random rules and ignoring others.

j45•1w ago

Don't want to dither the topic, but could skills not just be sub agents in this contextualization?

There is a lot of language floating around what effectively groups of text files put together in different configurations, or selected reliably.

micimize•1w ago

Measuring in terms of KB is not quite as useful as it seems here IMO - this should be measured in terms of context tokens used.

I ran their tool with an otherwise empty CLAUDE.md, and ran `claude /context`, which showed 3.1k tokens used by this approach (1.6% of the opus context window, bit more than the default system prompt. 8.3% is system tools).

Otherwise it's an interesting finding. The nudge seems like the real winner here, but potential further lines of inquiry that would be really illuminating: 1. How do these approaches scale with model size? 2. How are they impacted by multiple such clauses/blocks? Ie maybe 10 `IMPORTANT` rules dilute their efficacy 3. Can we get best of both worlds with specialist agents / how effective are hierarchical routing approaches really? (idk if it'd make sense for vercel specifically to focus on this though)

sghiassy•1w ago

N00b Question - how do you measure performance for AI agents like the way they did in this article? Are there frameworks to support this type of work?

robertheadley•1w ago

I will have to look into this this weekend. Antigravity is my current favorite agentic IDE and I have been having problems getting it to explicitly follow my agent.md settings.

If I remind it, it will be go, "oh yes, ok, sure." then do it, but the whole point is that I want to optimize my time with the agent.

embedding-shape•1w ago

I feel like all agents currently do better if you explicitly end with "Remember to follow AGENTS.md", even if that's automatically injected into the context. Seems the same across all I'm using.

killerstorm•1w ago

inb4 people re-discover RAG, re-branding it as a parallel speculative data lookup

xnx•1w ago

Agents.md, skills, MCP, tools, etc. There's going to be a lot of areas explored that may yield no/negative benefit over the fundamentals of a clear prompt.

thighbaugh•1w ago

> [Specifically Crafted Instructions For Working With A Specific Repository] outperforms [More General Instructions] in our agent evals

------> Captain Obvious Strikes Again! <----------

See the rest the comments for examples pedantic discussions about terms that are ultimately somewhat arbitrary and if anything suggest the singularity will be runaway technobabble not technological progress.

aaroninsf•1w ago

You will see another 14% bump in performance if you include in the first 16 lines of the README.md in the project, "Coding agents and LLM, see AGENTS.md"

rohitghumare•1w ago

That's why https://agenstskills.com validate every skills

farhanhubble•4d ago

> Before writing code, first explore the project structure, then invoke the nextjs-doc skill for documentation.

Does the model even understand what this line even means?

farhanhubble•4d ago

So the root cause was the model's indisposition to calling the skills. That seems contrary to what we see with function calling. Models call functions quite reliably most of the time. This is more likely because of the instructions not being clear about what skills are, as this snippet, albeit in isolation, seems to suggest:

> Before writing code, first explore the project structure, then invoke the nextjs-doc skill for documentation.

Tiny C Compiler

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

Show HN: Craftplan – Elixir-based micro-ERP for small-scale manufacturers

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

The F Word

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

IBM Beam Spring: The Ultimate Retro Keyboard

Eigen: Building a Workspace

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Selection rather than prediction

The AI boom is causing shortages everywhere else

I write games in C (yes, C) (2016)

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Learning from context is harder than we thought

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

Tiny C Compiler

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

Show HN: Craftplan – Elixir-based micro-ERP for small-scale manufacturers

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

The F Word

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

IBM Beam Spring: The Ultimate Retro Keyboard

Eigen: Building a Workspace

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Selection rather than prediction

The AI boom is causing shortages everywhere else

I write games in C (yes, C) (2016)

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Learning from context is harder than we thought

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

AGENTS.md outperforms skills in our agent evals

Comments