now build it for old codebase, let's see how precisely it edits or removes features without breaking the whole codebase
lets see how many tokens it consumes per bug fix or feature addition.
i just wrote this comment so people aren't under false belief that it's pretty much all coding agents do, making all this fault tolerant with good ux is lot of work.
1. Precompute frequently used knowledge and surface early. For example repository structure, os information, system time.
2. Anticipate next tool calls. If a match is not found while editing, instead of simply failing, return closest matching snippet. If read file tool gets a directory, return directory contents.
3. Parallel tool calls. Claude needs either a batch tool or special scaffolding to promote parallel tool calls. Single tool call per turn is very expensive.
Are there any other such general ideas?
I am still looking for a good "memory" solution, so far running without it. Haven't looked too deep into it.
Not sure how next tool call be predicted.
I am still using serial tool calls as i do not have any subagents, i just use fast inference models for directly tools calls. It works so fast, i doubt i'll benefit from parallel anything.
There are a few models that solve 30-50% of (new) tasks pulled from real-wolrd repos. So ... yeah.
I think the future will be dashboards/HUDs (there was an article on HN about this a bit ago and I agree). You'll get preview windows, dynamic action buttons, a kanban board, status updates, and still the ability to edit code yourself, of course.
The single-file lineup of agentic actions with user input, in a terminal chat UI, just isn't gonna cut it for more complicated problems. You need faster error reporting from multiple sources, you need to be able to correct the LLM and break it out of error loops. You won't want to be at the terminal even though it feels comfortable because it's just the wrong HCI tool for more complicated tasks. Can you tell I really dislike using these overly-simple agents?
You'll get a much better result with a dashboard/HUD. The future of agents is that multiple of them will be working at once on the codebase and they'll be good enough that you'll want more of a status-update-confirm loop than an agentic code editing tool update.
Also required is better code editing. You want to avoid the LLM making changes in your code unrelated to the requested problem. Gemini CLI often does a 'grep' for keywords in your prompt to find the right file, but your prompt was casual and doesn't contain the right keywords so you end up with the agent making changes that aren't intended.
Obviously I am working in this space so that's where my opinions come from. I have a prototype HUD-style webapp builder agent that is online right now if you'd like to check it out:
It's not got everything I said above - it's a work-in-progress. Would love any feedback you have on my take on a more complicated, involved, and narrow-focus agentic workflow. It only builds flask webapps right now, strict limits on what it can do (no cron etc yet) but it does have a database you can use in your projects. I put a lot of work into the error flow as well, as that seems like the biggest issue with a lot of agentic code tools.
One last technical note: I blogged about using AST transformations when getting LLMs to modify code. I think that using diffs or rewriting the whole file isn't the right solution either. I think that having the LLM write code that modifies your code and then running that code to affect the modifications is the way forward. We'll see I guess. Blog post: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...
Gemini CLI still uses archaic whole file format for edits, it's not a good representative of current state of coding agents.
Of course following nix philosophy is another way.
Surely listing files, searching a repo, editing a file can all be achieved with bash?
Or is this what's demonstrated by https://news.ycombinator.com/item?id=45001234?
If everything goes through bash then you need some way to separate always safe commands that don't need approval (such as listing files), from all other potentially unsafe commands that require user approval.
If you have listing files as a separate tool then you can also enforce that the agent doesn't list any files outside of the project directory.
My best guess is they started out with a limited subset of tools and realised they can just give it bash later.
One of the reasons why you get better performance if you give them the other tools is that there has been some reinforcement learning on Sonne with all these tools. The model is aware of how these tools work, it is more token-efficient and it is generally much more successful at performing those actions. The Bash tool, for instance, at times gets confused by bashisms, not escaping arguments correctly, not handling whitespace correctly etc.
This saves the LLM from having to do multiple low level clicking and typing and keeps it on track. Help the poor model out, will ya!?
Why the unnecessary generated AI pictures in between?
Why put everything that could have been a bullet point into it's own individual picture (even if it's not AI generated)? It's very visually distracting, breaks the flow of reading, and it's less accessible as all the picture lack alt-text.
---
I see that it's based on a conference talk, so it's possibly just 1:1 the slides. If that's the case please put it up in it's native conference format, rather than this.
ofirpress•4h ago
ghuntley•4h ago
simonw•3h ago
The whole thing runs on these prompts: https://github.com/SWE-agent/mini-swe-agent/blob/7e125e5dd49...
nivertech•27m ago
faangguyindia•2h ago
that's not the case with a codebase, where things are littered around in tune with specific model of organisation the developer had in mind.
koakuma-chan•2h ago
You wish
fmbb•1h ago
https://en.wikipedia.org/wiki/Lumpers_and_splitters
meander_water•2h ago
This prompt snippet from your instance template is quite useful. I use something like this for getting out of debug loops:
> Analyse the codebase and brainstorm a list of potential root causes for the issue, and rank them from most likely to least likely.
Then create scripts or add debug logging to confirm whether your hypothesis is correct. Rule out root causes from most likely to least by executing your scripts and observing the output in order of likelihood.
Teever•1h ago
BenderV•58m ago
I've built a SWE agent too (for fun), check it out => https://github.com/myriade-ai/autocode
diminish•52m ago
Lack of tools in mini-swe-agent is a feature. You can run it with any LLM no matter how big or small.