Yet when I use the Codex CLI, or agent mode in any IDE it feels like o3 regresses to below GPT-3.5 performance. All recent agent-mode models seem completely overfitted to tool calling. The most laughable attempt is Mistral's devstral-small - allegedly the #1 agent model, but going outside of scenarios you'd encounter in SWEbench & co it completely falls apart.
I notice this at work as well, the more tools you give any model (reasoning or not), the more confused it gets. But the alternative is to stuff massive context into the prompts, and that has no ROI. There's a fine line to be walked here, but no one is even close it yet.
This YT video (from 2 days ago) demonstrates it https://youtu.be/fQL1A4WkuJk?si=7alp3O7uCHY7JB16
The author builds a drawing app in an hour.
But the article itself also makes the point that a human assistant was also necessary. That's gonna be my take away.
> This is the single most impressive code-gen project I’ve seen so far. I did not think this was possible yet.
To get that sort of acclaim, a human had to build an embedded programming language from scratch to get to that point. And even with all that effort, the agent itself took $631 and 119 hours to complete the task. I actually don't think this is a knock on the idea at all, this is the direction I think most engineers should be thinking about.
That agent-built HTTP/2 server they're referencing is apparently the only example of this sort of output they've seen to date. But if you're active in this particular space, especially on the open source side of the fence, this kind of work is everywhere. But since they don't manifest themselves as super generic tooling that you can apply to broad task domains as a turnkey solution, they don't get much attention.
I've continually held the line that if any given LLM agent platform works well for your use case and you haven't built said agent platform yourself, the underlying problem likely isn't that hard or complex. For the hard problems, you gotta do some first-principles engineering to make these tools work for you.
This is built to be a lovable/bolt alternative and is definitely on the early side in terms of total capability and reliability. But once you start digging through the source you realize how much engineering actually went into building it. Not just chaining prompts together in a dart throwing exercise and praying for a good result.
This is much closer to the "turnkey" solution vertical I mentioned in my earlier commentary, since its meant to generically build any web app, but there's a few applied concepts that are shared with the promptyped approach used in the HTTP/2 server (though not as sophisticated when compared to the category theory / type theory approach).
I think it's a good example to work backwards from though, if you peel the onion a bit you realize how much more tightly you could scope this for more bespoke projects.
its a very niche app, and havent used it as much since buying it, but there's that https://repoprompt.com/
It's like listening to professional translators endlessly lament about translation software and all it's short comings and pitfalls, while totally missing that the software is primarily used for property managers wanting to ask the landscapers to cut the grass lower.
LLMs are excellent at writing code for people who have no idea what a programming language is, but a good idea of what computers can do when someone can speak this code language to them. I don't need an LLM to one-shot Excel.exe so I can track the number of members vs non-members who come to my community craft fair.
Writing hint: Your last paragraph stands well on its own. Especially if this is, in fact, your actual experience.
Nothing in that paragraph requires the negativity or inaccuracies of the preceding two paragraphs.
There should be a name for the human tendency (we have all done/do it) to weigh down good points with unnecessary and often inaccurate contrast/competition.
You can head off a lot of criticism by not making your point competitive with other reasonable points. I.e. additive to understanding, not subtractive.
Otherwise, you are actually creating the competition between points that you wanted to avoid. And creating your own distractions from your own point.
But agreed, there needs to be a better way for these agents to figure out what context to select. It doesn't seem like this will be too much of a large issue to solve though?
This seems true, right now!
But in building out stuff with LLMs, I don't expect (or want) them to do the job end-to-end. I've ~25 merged PRs into a project right now (out of ~40 PRs generated). Most merged PRs I pulled into Zed and cleaned something up. At around PR #10 I went in and significantly restructured the code.
The overall process has been much faster and more pleasant than writing from scratch, and, notably, did not involve me honing my LLM communications skills. The restructuring work I did was exactly the same kind of thing I do on all my projects; until you've got something working it's hard to see what the exact right shape is. I expect I'll do that 2-3 more times before the project is done.
I feel like Kenton Varda was trying to make a point in the way they drove their LLM agent; the point of that project was in part to record the 2025 experience of doing something complicated end-to-end with an agent. That took some doing. But you don't have to do that to get a lot of acceleration from LLMs.
Believe it or not I agree.
Instead we should be accepting that people will or wont find uses for it depending on their competency (CRUD app churn VS somewhat novel creations) and accept that without telling them they’re nuts, luddites, etc.
Then again like I said the people doing that usually have something to gain such as a product related to the hype generating product.
Here’s an example article that hit the front page for HN this week https://fly.io/blog/youre-all-nuts/
Do you think he’s secretly against the tool itself or do you acknowledge that maybe the tool just doesn’t work for him and his use case and maybe he’s not nuts for finding fault with it?
--the link itself says "you", but that's also addressing your friends I presume?
Edit: the politicians too?
>To the consternation of many of my friends, I’m not a radical or a futurist.
(Apologies if you were tipsy at any point in the relevant parts)
The link: fly.io/blog/YOUre-all-nuts/
(XD)
It's just that... Framing your post in terms of the opinions of devs that you personally know, and not those of the AI-assistance community at large* resolved your issue with ofjcihen. for me (only, it seems :().
*explicitly including e.g., Lisp, Haskell (radical futurists?),
& (dare I mention) SwiftUI devs
Let's repeat this process for 100 coding examples and see how many it can complete "hands-off" especially where (a) it isn't a case of here is a spec and I need you to implement it and (b) it isn't for a a use for which there is already publicly available code.
Otherwise your claim of "this seems true, right now!" is baseless.
I tried to have an LLM fully write a Python library for me. I wrote an extensive, detailed requirements doc, and it looked close enough that I accepted the code. But as I read through it more closely, I realized it was duplicating code in confusing ways, and overall it took longer than just getting the bones down myself, first.
Some coding agents are now more actively indexing code, I think that should help with this problem
It’s the first useful “agent” (LLM in a loop + tools) that I’ve tried.
IME it is hard to explain why it’s better than e.g. Aider or Cursor, but once you try it you’ll migrate your workflow pretty quickly.
Or if you want to work more manually, you could do the same but not allow full access to git commit. That way it will request access each time it’s ready to commit and you can take that time to review diffs.
i have cursor through work but i am tempted to shell out $100 because of this hype.
is it better than using claude models in cursor?
Today I spent easily half an hour trying to make it solve a layout issue it itself introduced when porting a component.
It was a complex port it executed perfectly. And then it completely failed to even create a simple wrapper that fixed a flexbox issue.
BTW. Claude (Code and Cursor) is over-indexed on "let's randomly add and remove h-full/overflow-auto and pretend it works ad infinitum"
yea this is the problem with vibe coding. its hard to understand and keep tabs on nitty gritty when stuff is being generated for you. No matter how much you 'review' it, it just doesn't stick in the same way if you were writing code. You are really screwed if you have debug something that llm throws its hands up on.
It's definitely on point with some strategic layout items, flexbox, etc., but when it comes to anything like colors, margins, padding, typeface, borders, etc., you might as well be throwing darts into the void.
I can't even with the ego here. The best teachers practice humility.
quantum_state•8mo ago
tarasglek•8mo ago
max2he•8mo ago