Yet when I use the Codex CLI, or agent mode in any IDE it feels like o3 regresses to below GPT-3.5 performance. All recent agent-mode models seem completely overfitted to tool calling. The most laughable attempt is Mistral's devstral-small - allegedly the #1 agent model, but going outside of scenarios you'd encounter in SWEbench & co it completely falls apart.
I notice this at work as well, the more tools you give any model (reasoning or not), the more confused it gets. But the alternative is to stuff massive context into the prompts, and that has no ROI. There's a fine line to be walked here, but no one is even close it yet.
This YT video (from 2 days ago) demonstrates it https://youtu.be/fQL1A4WkuJk?si=7alp3O7uCHY7JB16
The author builds a drawing app in an hour.
But the article itself also makes the point that a human assistant was also necessary. That's gonna be my take away.
It's like listening to professional translators endlessly lament about translation software and all it's short comings and pitfalls, while totally missing that the software is primarily used for property managers wanting to ask the landscapers to cut the grass lower.
LLMs are excellent at writing code for people who have no idea what a programming language is, but a good idea of what computers can do when someone can speak this code language to them. I don't need an LLM to one-shot Excel.exe so I can track the number of members vs non-members who come to my community craft fair.
But agreed, there needs to be a better way for these agents to figure out what context to select. It doesn't seem like this will be too much of a large issue to solve though?
This seems true, right now!
But in building out stuff with LLMs, I don't expect (or want) them to do the job end-to-end. I've ~25 merged PRs into a project right now (out of ~40 PRs generated). Most merged PRs I pulled into Zed and cleaned something up. At around PR #10 I went in and significantly restructured the code.
The overall process has been much faster and more pleasant than writing from scratch, and, notably, did not involve me honing my LLM communications skills. The restructuring work I did was exactly the same kind of thing I do on all my projects; until you've got something working it's hard to see what the exact right shape is. I expect I'll do that 2-3 more times before the project is done.
I feel like Kenton Varda was trying to make a point in the way they drove their LLM agent; the point of that project was in part to record the 2025 experience of doing something complicated end-to-end with an agent. That took some doing. But you don't have to do that to get a lot of acceleration from LLMs.
It’s the first useful “agent” (LLM in a loop + tools) that I’ve tried.
IME it is hard to explain why it’s better than e.g. Aider or Cursor, but once you try it you’ll migrate your workflow pretty quickly.
Or if you want to work more manually, you could do the same but not allow full access to git commit. That way it will request access each time it’s ready to commit and you can take that time to review diffs.
i have cursor through work but i am tempted to shell out $100 because of this hype.
is it better than using claude models in cursor?
Today I spent easily half an hour trying to make it solve a layout issue it itself introduced when porting a component.
It was a complex port it executed perfectly. And then it completely failed to even create a simple wrapper that fixed a flexbox issue.
BTW. Claude (Code and Cursor) is over-indexed on "let's randomly add and remove h-full/overflow-auto and pretend it works ad infinitum"
quantum_state•7h ago
tarasglek•7h ago
max2he•4h ago