First, about the loop, Claude's (coding agent) context and attention is big enough to self-reflect. Agent Tuning shows a technique that not only demonstrates this but a way quantify it. [0] The difference is autoresearch's val_bpb measures what the agent built; Agent Tuning's p̂ measures the agent itself.
> Claude's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context.
Second, doing research, finding academic research to add to context helps. Here is an example of an implementation that creates trading strategies by reading research and recreating them in creative new ways. [1]
The biggest problem is the coding agents don't "Fail fast and loud". They fail deceivingly.
GPT 2 and 3 used to fail fast (and loud coz we could easily see it lying)
After one month working on using Claude to create trading strategies, the one thing I learned; if the strategy looks like it can profit, it is a lie. The trading strategy agent doesn't find trading strategies that work, it is really a bug hunting agent.
> TL;DR: Coding agents generate better optimizations when they read papers and study competing projects before touching code
What made you think I hadn't read the article, let alone that TL;DR? I'm really curious. Jumping to an insulting "have you read the article" is a big step, so it'll be really interesting to see where your mind went.
However, I'd be curious to hear back from others who have tried adding the shell script (at the end of the article) to their flow: does it (really) improve Claude?
To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.
I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.
Reading all the papers once isn't the same as this. I find it very useful.
I can ask an LLM to do the basic implementations, then I can refine them (make the code better, faster, cut on memory use), then I can ask the LLM if I'm still implementing the algorithms as they're described in the paper.
Unit testing would save on tokens... unit testing is perfect for validating refactors, or when re-writing a project from one language to the next, build unit tests first.
Honestly I think that Markdown with LateX code blocks would be the most efficient representation but when doing it with Pandoc I kept having issues with loss of information and sometimes even syntax error.
Then something in your {CLAUDE,AGENTS}.md that says: when working on something with relevant context supplied by papers, read the papers before doing the work. You can find all papers plus their descriptions in ./papers/INDEX.md and papers by tag in ./papers/tagged
Thanks for sharing!
> The full setup works with any project that has a benchmark and test suite.
so having a clear and measurable verification step is key. Meaning you can't simply give an AI agent a vague goal e.g. "improve the quality of the codebase" because it's too general.
Literally every project. If it's something that's been done a million times then that means it has good literature on it? If not, then even more important to find related stuff! And not just crunchy CS stuff like databases or compilers or whatever. Are you creating a UI? There's probably been great UI research you can base off of! Will this game loop be fun in the game you're building? There's probably been research about it!
Also I wonder who/what decides what papers go in there.
In the blog post, the agent is allowed to do its own search.
Claude is much faster and better at reading papers than Codex (some of this is nested skill dispatch) but they both work quite incredibly for this. Compile your set of papers, queue it up and hit /ingest-collection and go sleep, and come back to a remarkable knowledge base :)
Feels like most tools skip that step.
Do you see a noticeable difference in output quality when the agent reads context first vs going straight into generation?
Feels like most tools skip that step.
I see it as the solution being out there in “idea space”, and by having the agent search beforehand we can more efficiently explore this space before converging on the final solution.
Then you could just prompt it to propose options with pros and cons etc.
* Bar extremely new stuff from after the cutoff
- deep research for papers, projects etc. I prefer ChatGPT Pro Deep Research here As it can quickly survey hundreds of sources for overall relevance
- deep dives into specific papers and projects, where an AI coding agent downloads relevant papers and projects for local analysis loops, performs technical breakdowns into essentially a markdown wiki, and then reduces over all of them into a findings report. Claude code is a bit nicer here because it supports parallel subagents well.
- iterative design phase where the agent iterates between the papers repos and our own project to refine suggestions and ideas
Fundamentally, this is both exciting, but also limiting: It's an example of 'Software Collapse' where we get to ensure best practices and good ideas from relevant communities, but the LLM is not doing the creativity here, just mashing up and helping pick.
Tools to automate the stuff seems nice. I'd expect it to be trained into the agents soon as it's not far from their existing capabilities already. Eg, 'iteratively optimize function foobar, prefer GPU literature for how.'
Based on this finding, I suppose the better way is to rely on local hardware whenever possible?
This fits into the paradigm of finding ways to force better context engineering.
hopechong•1d ago
We added a literature review phase to Karpathy’s autoresearch loop and pointed it at llama.cpp. The agent autonomously read arxiv papers, studied competing forks and spun up VMs to run parallel experiments.