It's an interesting take, one that I believe could be true, but it sounds more like an opinion than a thesis or even fact.
Every hype cycle goes through some variation of this evolution. As much as folks try to say AI is different it’s following the same very predictable hype cycle curve.
So far it's just reinforcing my feeling that none of this is actually used at scale.. We use AI as relatively dumb companions, let them go wilder on side projects which have loser constraints, and Agent are pure hype (or for very niche use cases)
They came in primed against agentic work flow. That is fine. But they also came in without providing anything that might have given other people the chance to show that their initial assumptions was flawed.
I've been working with agents daily for several months. Still learning what fails and what works reliably.
Key insights from my experience: - You need a framework (like agent-os or similar) to orchestrate agents effectively - Balance between guidance and autonomy matters - Planning is crucial, especially for legacy codebases
Recent example: Hit a wall with a legacy system where I kept maxing out the context window with essential background info. After compaction, the agent would lose critical knowledge and repeat previous mistakes.
Solution that worked: - Structured the problem properly - Documented each learning/discovery systematically - Created specialized sub-agents for specific tasks (keeps context windows manageable)
Only then could the agent actually help navigate that mess of legacy code.
My experience is that once I switch to this mode when something blows up I'm basically stuck with a bunch of code that I sort of know, even tough I reviewed it. I just don't have the same insight as I would if I wrote the code, no ownership, even if it was committed in my name. Like any misconceptions I've had about how things work I will still have because I never had to work through the solution, even if I got the final working solution.
Of course there are many more bugs they'll currently not find, but when this strategy costs next to nothing (compared to a SWE spending an hour spelunking) and still works sometimes, the trade-off looks pretty good to me.
Unlike the model providers, Cursor has to pay the retail price for LLM usage. They're fighting an ugly marginal price war. If you're paying more for inference than your competitors, you have to choose to either 1) deliver equal performance as other models at a loss or 2) economize by way of feeding smaller contexts to the model providers.
Cursor is not transparent on how it handles context. From my experience, it's clear that they use aggressive strategies to prune conversations to the extent that it's not uncommon that cursor has to reference the same file multiple times in the same conversation just to know what's going on.
My advice to anyone using Cursor is to just stop wasting your time. The code it generates creates so much debt. I've moved on to Codex and Claude and I couldn't be happier.
I just feel that models are currently not up to speed with experienced engineers where it takes less time to develop something then to instruct model to do it. It is only usefull for boring work.
This is not to say that these tools didn't created oportunities to create new stuff, it is just that the hype overestimates the usefullnes of the tools so they can sell them better just like all other things.
i work on agentic systems and they can be good if agent has a bit-sized chuck of work it needs to do. problme with the coding agents is that for every more complex thing you will need to write a big prompt which is sometimes counter productive and it seems to me that user in cursor thread is pointing in that direction.
The simplest explanation would be “You’re using it wrong…”, but I have the impression that this is not the primary reason. (Although, as an AI systems developer myself, you would be surprised by the number of users who simply write “fix this” or “generate the report” and then expect an LLM to correctly produce the complex thing they have in mind.)
It is true that there is an “upper management” hype of trying to push AI into everything as a magic solution for all problems. There is certainly an economic incentive from a business valuation or stock price perspective to do so, and I would say that the general, non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor.
While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.
That is not reliable, that's the opposite of reliable.
> [..] possibly the repo is too far off the data distribution.
(Karpathy's quote)
Two of the key skills needed for effective use of LLMs are writing clear specifications (written communication), and management, skills that vary widely among developers.
Take Joe. Joe sticks with AI and uses it to build an entire project. Hundreds of prompts. Versus your average HNer who thinks he’s the greatest programmer in the company and thinks he doesn’t need AI but tries it anyway. Then AI fails and fulfills his confirmation bias and he never tries it again.
But mostly my experience is that people who regularly get good output from AI coding tools fall into these buckets:
A) Very limited scope (e.g. single, simple method with defined input/output in context)
B) Aren't experienced enough in the target domain to see the problems with the AI's output (let's call this "slop blindness")
C) Use AI to force multiple iterations of the same prompt to "shake out the bugs" automatically instead of using the dev's time
I don't see many cases outside of this.
Some developers will either retrospectively change the spec in their head or are basically fine with the slight deviation. Other developers will be disappointed, because the LLM didn't deliver on the spec they clearly hold in their head.
It's a bit like a psychological false memory effect where you misremember and/or some people are more flexibel in their expectations and accept "close enough" while others won't accept this.
At least, I noticed both behaviors in myself.
Both situations need an iterative process to fix and polish before the task is done.
The notable thing for me was, we crossed a line about six months ago where I'd need to spend less time polishing the LLM output than I used to have to spend working with junior developers. (Disclaimer: at my current place-of-work we don't have any junior developers, so I'm not comparing like-with-like on the same task, so may have some false memories there too.)
But I think this is why some developers have good experiences with LLM-based tools. They're not asking "can this replace me?" they're asking "can this replace those other people?"
The simplest explanation is that most of us are code monkeys reinventing the same CRUD wheel over and over again, gluing things together until they kind of work and calling it a day.
"developers" is such a broad term that it basically is meaningless in this discussion
lol.
another option is trying to convince yourself that you have any idea what the other 2,000,000 software devs are doing and think you can make grand, sweeping statements about it.
there is no stronger mark of a junior than the sentiment you're expressing
My hypothesis is that developers work on different things and while these models might work very well for some domains (react components?) they will fail quickly in others (embedded?). So one one side we have developers working on X (LLM good at it) claiming that it will revolutionize development forever and the other side we have developers working on Y (LLM bad at it) claiming that it's just a fad.
Some possible reasons:
* different models used by different folks, free vs paid ones, various reasoning effort, quantizations under the hood and other parameters (e.g. samplers and temperature)
* different tools used, like in my case I've found Continue.dev to be surprisingly bad, Cline to be pretty decent but also RooCode to be really good; also had good experiences with JetBrains Junie, GitHub Copilot is *okay*, but yeah, lots of different options and settings out there
* different system prompts, various tool use cases (e.g. let the model run the code tests and fix them itself), as well as everything ranging from simple and straightforward codebases that are dime a dozen out there (and in the training data), vs something genuinely new that would trip up both your average junior dev, as well as the LLMs
* open ended vs well specified tasks, feeding in the proper context, starting new conversations/tasks when things go badly, offering examples so the model has more to go off of (it can predict something closer to what you actually want), most of my prompts at this point are usually multiple sentences, up to a dozen, alongside code/data examples, alongside prompting the model to ask me questions about what I want before doing the actual implementation
* also sometimes individual models produce output for specific use cases badly, I generally rotate between Sonnet 4.5, Gemini Pro 2.5, GPT-5 and also use Qwen 3 Coder 480B running on Cerebras for the tasks I need done quickly and that are more simple
With all of that, my success rate is pretty great and the statement about the tech not being able to "...barely follow a simple instruction" holds untrue. Then again, most of my projects are webdev adjacent in mostly mainstream stacks, YMMV.This is probably the most significant part of your answer. You are asking it to do things for which there are a ton of examples of in the training data. You described narrowing the scope of your requests too, which tends to be better.
In the fixed world of mathematics, everything could in principle be great. In software, it can in principle be okay even though contexts might be longer. When dealing with new contexts in something like real life, but different-- such as a story where nobody can communicate with the main characters because they speak a different language, then the models simply can't deal with it, always returning to the context they're familiar with.
When you give them contexts that are different enough from the kind of texts they've seen, they do indeed fail to follow basic instructions, even though they can follow seemingly much more difficult instructions in other contexts.
This is just what I observe on HN, I don't doubt there's actual devs (rather than the larping evangelist AI maxis) out there who actually get use out of these things but they are pretty much invisible. If you are enthusiastic about your AI use, please share how the sausage gets made!
The question assumes that all developers do the same work. The kind of work done by an embedded dev is very different from the work of a front-end dev which is very different from the kind of work a dev at Jane Street does. And even then, devs work on different types of projects: greenfield, brownfield and legacy. Different kind of setups: monorepo, multiple repos. Language diversity: single language, multiple languages, etc.
Devs are not some kind of monolith army working like robots in a factory.
We need to look at these factors before we even consider any sort of ML.
But the other thing is that, your expectations normalise, and you will hit its limits more often if you are relying on it more. You will inevitably be unimpressed by it, the longer you use it.
If I use it here and there, I am usually impressed. If I try to use it for my whole day, I am thoroughly unimpressed by the end, having had to re-do countless things it "should" have been capable of based on my own past experience with it.
My guess is that for some types of work people don't know what the complex thing they have in mind is ex ante. The idea forms and is clarified through the process of doing the work. For those types of task there is no efficiency gain in using AI to do the work.
For those types of tasks it probably takes the same amount of time to form the idea without AI as with AI, this is what Metr found in its study of developer productivity.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... https://arxiv.org/abs/2507.09089
Xkcd's "Uncomfortable Truths Well" said, "You will never find a programming language that frees you from the burden of clarifying your ideas." LLMs don't fundamentally change that dynamic.
It still takes a lot of practice to get good at prompting, though.
There is a bit of overlap for the stuff you use agents and the stuff that AI is good at. Like generating a bunch of boilerplate for a new thing from scratch. That makes the agent mode more convenient for me to interact with AI for the stuff it's useful in my case. But my experience with these tools is still quite limited.
"when you hear 'intelligent agent'; think 'trainable ant'"
CEOs, AI "thought leaders," and VCs are advertising LLMs as magic, and tools like v0 and Lovable as the next big thing. Every response from leaders is some variation of https://www.youtube.com/watch?v=w61d-NBqafM
On the ground, we know that creating CLAUDE.md or cursorrules basically does nothing. It’s up to the LLM to follow instructions, and it does so based on RNG as far as I can tell. I have very simple, basic rules set up that are never followed. This leads me to believe everyone posting on that thread on Cursor is an amateur.
Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.
I’m at a stage where I use LLMs the same way I would use speech-to-text (code) - telling the LLM exactly what I want, what files it should consider, and it adds _some_ value by thinking of edge cases I might’ve missed, best practices I’m unaware of, and writing better grammar than I do.
Edit:
[1] To add to this, any time you use search or Perplexity or what have you, the results come from all this marketing garbage being pumped into the internet by marketing teams.
This is spot on. Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).
What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs). But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.
1. LLM's would suck at coming up with new algorithms.
2. I wouldn't let an LLM decide how to structure my code. Interfaces, module boundaries etc
Other than that, given the right context (the sdk doc for a unique hardware for eg) and a well organised codebase explained using CLAUDE.Md they work pretty well in filling out implementations. Just need to resist the temptation to prompt while the actual typing would take seconds.Of course, if you don't know what you are looking for, it can make that process much easier. I think this is why people at the junior end find it is making them (a claimed) 10x more productive. But people who have been around for a long time are more skeptical.
To be fair, this is super, super helpful.
I do find LLMs helpful for search and providing a bunch of different approaches for a new problem/area though. Like, nothing that couldn't be done before but a definite time saver.
Finally, they are pretty good at debugging, they've helped me think through a bunch of problems (this is mostly an extension of my point above).
Hilariously enough, they are really poor at building MCP like stuff, as this is too new for them to have many examples in the training data. Makes total sense, but still endlessly amusing to me.
I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.
Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.
My thinking here is that we already had the technologies of the LLMs and the compute, but we hadn't yet had the reason and capital to deploy it at this scale.
So the surprising innovation of transformers did not give us the boost in capability itself, it still needed scale. The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever, it needs returns.
Scale has been exponential, and we are hitting an insane amount of capital deployment for this one technology that, has yet to prove commercially viable at the scale of a paradigm shift.
Are businesses that are not AI based, actually seeing ROI on AI spend? That is really the only question that matters, because if that is false, the money and drive for the technology vanishes and the scale that enables it disappears too.
To comment om this, because its the most common counter argument. Most technology has worked in steps. We take a step forward, then iterate on essentially the same thing. It's very rare we see order of magnitude improvement on the same fundamental "step".
Cars were quite a step forward from donkeys, but modern cars are not that far off from the first ones. Planes were an amazing invention, but the next model of plane is basically the same thing as the first one.
I think we would need another leap to actually meet the markets expectations on AI. The market is expecting AGI, but I think we are probably just going to do incremental improvements for language and multi modal models from here, and not meet those expectations.
I think the market is relying on something that doesn't currently exist to become true, and that is a bit irrational.
The explosion of compute and investment could mean that we have more researchers available for that event to happen, but at the same time transformers are sucking up all the air in the room.
??? It has already become exceptional. In 2.5 years (since chatgpt launched) we went from "oh, look how cute this is, it writes poems and the code almost looks like python" to "hey, this thing basically wrote a full programming language[1] with genz keywords, and it mostly works, still has some bugs".
I think the goalpost moving is at play here, and we quickly forget how 1 year makes a huge difference (last year you needed tons of glue and handwritten harnesses to do anything - see aider) and today you can give them a spec and get a mostly working project (albeit with some bugs), 50$ later.
I am not saying it's impossible, but there is no evidence that the leap in technology to reach wild profitability (replacing general labour) such investment desires is just around the corner either.
Let say we found a company that already realized 5-10% of savings in the first step. Now, based on this we might be able to map out the path to 25-30% savings in 5% steps for example.
I personally haven’t seen this, but I might have missed it as well.
It did but it's kinda stagnated now especially on the LLM front. The time when ever week a groundbreaking model came out is over for now. Later revisions of existing models, like GPT5 and llama4 have been underwhelming.
Which makes sense, considering the absolutely massive amount of tutorials and basic HOWTOs that were present in the training data, as they are the easiest kind of programming content to produce.
To that end, it doesn't matter if it works or not, it just has to demo well.
Don't forget HackerNews.
Every single new release from OpenAI and other big AI firms attracts a lot of new accounts posting surface-level comments like "This is awesome" and then a few older accounts that have exclusively posted on previous OpenAI-related news to defend them.
It's glaringly obvious, and I wouldn't be surprised if at least a third of the comments on AI-related news is astroturfing.
That said, I have no doubt there are also bots setting out to generate FOMO
https://mitchellh.com/writing/non-trivial-vibing went round here recently, so clearly LLMs are working in some cases.
Or the "I created 30 different .md instruction files and AI model refactored/wrote from scratch/fixed all my bugs" trope.
> a third of the comments on AI-related news is astroturfing.
I wouldn't be surprised if it's even more than that.. And, ironically, probably aided in their astroturfing, by the capability of said models to spew out text..
https://techcrunch.com/2025/09/08/sam-altman-says-that-bots-...
It's implemented methods I'd have to look up in books to even know about, and shown that it can get them working. It may not do much truly "novel" work, but very little code is novel.
They follow instructions very well if structured right, but you can't just throw random stuff in CLAUDE.md or similar. The biggest issue I've run into recently is that they need significant guidance on process. My instructions tends to focus on three separate areas: 1) debugging guidance for a given project (for my compiler project, that means things like "here's how to get an AST dumped from the compiler" and "use gdb to debug crashes" (it sometimes did that without being told, but not consistently; with the instructions it usually does tht), 2) acceptance criteria - this does need reiteration, 3) telling it to run tests frequently, make small, testable changes, and to frequently update a detailed file outlining the approach to be taken, progress towards it, and any outcomes of investigation during the work.
My experience is that with those three things in place, I can have Claude run for hours with --dangerously-skip-permissions and only step in to say "continue" or do a /compact in the middle of long runs, with only the most superficial checks.
It doesn't always provide perfect code every step. But neither do I. It does however usually move in the right direction every step, and has consistently produced progress over time with far less effort on my behalf.
I wouldn't have it start from scratch without at least some scaffolding that is architecturally sound yet, but it can often do that too, though that needs review before it "locks in" a bad choice.
I'm at a stage where I'm considering harnesses to let Claude work on a problem over the course of days without human intervention instead of just tens of minutes to hours.
But that is exactly the problem, no?
It is like, when you need some prediction (e.g. about market behavior), knowing that somewhere out there there is a person who will make the perfect one. However, instead of your problem being to make the prediction, now it is how to find and identify that expert. Is that type of problem that you converted yours into any less hard though?
I too had some great minor successes, the current products are definitely a great step forward. However, every time I start anything more complex I never know in advance if I end up with utterly unusable code, even after corrections (with the "AI" always confidently claiming that now it definitely fixed the problem), or something usable.
All those examples such as yours suffer from one big problem: They are selected afterwards.
To be useful, you would have to make predictions in advance and then run the "AI" and have your prediction (about its usefulness) verified.
Selecting positive examples after the work is done is not very helpful. All it does is prove that at least sometimes somebody gets something useful out of using an LLM for a complex problem. Okay? I think most people understand that by now.
PS/Edit: Also, success stories we only hear about but cannot follow and reproduce may have been somewhat useful initially, but by now most people are beyond that, willing to give it a try, and would like to have a link to the working and reproducible example. I understand that work can rarely be shared, but then those examples are not very useful any more at this point. What would add real value for readers of these discussions now is when people who say they were successful posted the full, working, reproducible example.
EDIT 2: Another thing: I see comments from people who say they did tweak CLAUDE.md and got it to work. But the point is predictability and consistency! If you have that one project where you twiddled around with the file and added random sentences that you thought could get the LLM to do what you need, that's not very useful. We already know that trying out many things sometimes yields results. But we need predictability and consistency.
We are used to being able to try stuff, and when we get it working we could almost always confidently say that we found the solution, and share it. But LLMs are not that consistent.
I don't agree with this. LLMs will go out of their way to follow any instruction they find in their context.
(E.g. i have "I love napkin math" in my kagi Agent Context, and every LLM will try to shoehorn some kind of napkin math into every answer.)
Cursor and Co do not follow these instructions because they:
(a) never make it into the context in the first place, or (b) fall out of the context window.
While I agree, the only cases where I actually created something barely resembling useful (while still of subpar quality) was only after putting in CLAUDE.md lines like:
YOUR AIM IS NOT TO DELIVER A PROJECT. YOU AIM IS TO DO DEEP, REPETITIVE E2E TESTING. ONLY E2E TESTS MATTER. BE EXTREMELY PESSIMISTIC. NEVER ASSUME ANYTHING WORKS. ALWAYS CHECK EVERY FEATURE IN AT LEAST THREE DIFFERENT WAYS. USE ONLY E2E TESTS, NEVER USE OTHER TYPES OF TEST. BE EXTREMELY PESSIMISTIC. NEVER TRUST ANY CODE UNLESS YOU DEEPLY TEST IT E2E
REMEMBER, QUICK DELIVERY IS MEANINGLESS, IT'S NOT YOUR AIM. WORK VERY SLOWLY, STEP BY STEP. TAKE YOUR TIME AND RE-VERIFY EACH STEP. BE EXTREMELY PESSIMISTIC
With this kind of setup, it kind attempts to work in a slightly different way than it normally does and is able to build some very basic stuff although frankly I'd do it much better so not sure about the economics here. Maybe for people who don't care or won't be maintaining this code it doesn't matter but personally I'd never use it in my workplace.
This is my most consistent experience. It is great at catching the little silly things we do as humans. As such I have found them to be most useful as PR reviewers which you take with a pinch of salt
It's great, some of the time, the great draw of computing was that it would always catch the silly things we do as humans.
If it didn't we'd change the change code and the next time (and forever onward) it would catch that case too.
Now we're playing wack-a-mole and pleading with words like "CRITICAL" and bold text to our in .cursorrules to try and make the LLM pay attention, maybe it works today, might not work tomorrow.
Meanwhile the C-suite pushing these tools onto us still happily blame the developers when there's a problem.
Not my experience. I've used LLMs to write highly specific scientific/niche code and they did great, but obviously I had to feed them the right context (compiled from various websites and books convered to markdown in my case) to understand the problem well enough. That adds additional work on my part, but the net productivity is still very much positive because it's one-time setup cost.
Telling LLMs which files they should look at was indeed necessary 1-2 years ago in early models, but I have not done that for the last half year or so, and I'm working on codebases with millions of lines of code. I've also never had modern LLMs use nonexistent libraries. Sometimes they try to use outdated libraries, but it fails very quickly once they try to compile and they quickly catch the error and follow up with a web search (I use a custom web search provider) to find the most appropriate library.
I'm convinced that anybody who says that LLMs don't work for them just doesn't have a good mental model of HOW LLMs work, and thus can't use them effectively. Or their experience is just outdated.
That being said, the original issue that they don't always follow instructions from CLAUDE/AGENT.md files is quite true and can be somewhat annoying.
Which language are you using?
I've been 5x more productive using codex-cli for weeks. I have no trouble getting it to convert a combination of unusually-structured source code and internal SVGs of execution traces to a custom internal JSON graph format - very clearly out-of-domain tasks compared to their training data. Or mining a large mixed python/C++ codebase including low-level kernels for our RISCV accelerators for ever-more accurate docs, to the level of documenting bugs as known issues that the team ran into the same day.
We are seeing wildly different outcomes from the same tools and I'm really curious about why.
The "social kool-aid" side is even worse. A lot of very rich and very influential people have bet their career on AI - especially large companies who just outright fired staff to be replaced both by actual AI and "Actually Indians" [2] and are now putting insane pressure on their underlings and vendors to make something that at least looks on the surface like the promised AI dreams of getting rid of humans.
Both in combination explains why there is so much half-baked barely tested garbage (or to use the term du jour: slop) being pushed out and force fed to end users, despite clearly not being ready for prime time. And on top of that, the Pareto principle also works for AI - most of what's being pushed is now "good enough" for 80%, and everyone is trying to claim and sell that the missing 20% (that would require a lot of work and probably a fundamentally new architecture other than RNG-based LLMs) don't matter.
[1] https://www.bbc.com/news/articles/cz69qy760weo
[2] https://www.osnews.com/story/142488/ai-coding-chatbot-funded...
e.g. https://www.noahpinion.blog/p/americas-future-could-hinge-on...
Is it not a disaster already? The fast slide towards autocracy should certainly be viewed as a disaster if nothing else.
The others tried it and ran into the obvious Achilles heels and are now pretty cautious. But use it for a thing or two.
1. New conversation. Describe at a high level what change I want made. Point out the relevant files for the LLM to have context. Discuss the overall design with the LLM. At the end of that conversation, ask it to write out a summary (including relevant files to read for context next time) in an "epic" document in llm/epics/. This will almost always have several steps, listed in the document.
Then I review this and make sure it's in line with what I want.
2. New conversation. We're working on @llm/epics/that_epic.md. Please read the relevant files for context. We're going to start work on step N. Let me know if you have any questions; when you're ready, sketch out a detailed plan of implementation.
I may need to answer some questions or help it find more context; then it writes a plan. I review this plan and make sure it's in line with what I want.
3. New conversation. We're working on @llm/epics/that_epic.md. We're going to start implementing step N. Let me know if you have any questions; when you're ready, go ahead and start coding.
Monitor it to make sure it doesn't get stuck. Any time it starts to do something stupid or against the pattern of what I'd like -- from style, to hallucinating (or forgetting) a feature of some sub-package -- add something to the context files.
Repeat until the epic is done.
If this sounds like a lot of work, it is. As xkcd's "Uncomfortable Truths Well" said, "You will never find a programming language that frees you from the burden of clarifying your ideas." LLMs don't fundamentally change that dynamic. But they do often come up with clever solutions to problems; their "stupid questions" often helps me realize how unclear my thinking is; they type a lot faster, and they look up documentation a lot faster too.
Sure, they make a bunch of frustrating mistakes when they're new to the project; but if every time they make a patterned mistake, you add that to your context somehow, eventually these will become fewer and fewer.
Without mentioning what the LLMs are failing or succeeding at, it's all noise.
Yet, it makes the bulk of the work. Saves brain energy, that goes into the edge cases then. The overall time is the same, it is just the result could become more robust in the end. Only with good supervision! (which has better chance when we are not worn out with the tedious heavy lifting part)
But the one undebatable benefit is that the user can feel the smartest person in the whole wide world having so 'excellent questions', and 'knowing the topic like a pro', or being 'fantastic to spot such subtle details'. Anyone feel inadequate should use an agentic AI to boost self morale! (well, only if the person does not get nauseous from that thick flattering)
Can we please make it a point to share the following information when we talk about experiences with code bots?
1) Language - gives us an idea if the language has a large corpus of examples or not
2) Project - what were you using it for?
3) Level of experience - neophyte coder? Dunning Krueger uncertainty? Experience in managing other coders? Understand project implementation best practices ?
From what I can tell/suspect, these 3 features are the likely sources of variation in outcomes.
I suspect level of experience is doing significant heavy lifting, because more experienced devs approach projects in a manner that avoids pitfalls from the get go.
AI = Absent Intelligence.
On the other hand colleagues working with react and next have better experience with agents.
Yes, LLM are useful, but they are even less trustworthy than real humans, and one needs actual people to verify their output, so when agents write 100K lines of code, they'll make mistakes, extremely subtle ones, and not the kind of mistake any human operator would make.
You have to very good at writing tasks while being fully aware of what the one executing it knows and doesn't know. What agents can infer about a project themselves is even more limited than their context, so it's up to you to provide it. Most of them will have no or very limited "long-term" memory.
I've had good experiences with small projects using the latest models. But letting them sift through a company repo that has been worked on by multiple developers for years and has some arcane structures and sparse documentation - good luck with that. There aren't many simple instructions to be made there. The AI can still save you an hour or two of writing unit tests if they are easy to set up and really only need very few source files as context.
But just talking to some people makes it clear how difficult the concept of implicit context is. Sometimes it's like listening to a 4 year old telling you about their day. AI may actually be better at comprehending that sort of thing than I am.
One criticism I do have of AI in its current state is that it still doesn't ask questions often enough. One time I forgot to fill out the description of a task - but instead of seeing that as a mistake it just inferred what I wanted from the title and some other files and implemented it anyway. Correctly, too. In that sense it was the exact opposite of what OP was complaining about, but personally I'd rather have the AI assume that I'm fallible instead of confidently plowing ahead.
It's got a lot to do with problem framing and prompt imo.
kykat•3h ago
motorest•3h ago
I don't know what you are trying to say with your post. I mean, if two persons feed their prompts to an agent and while one is able to reach their goals the other fails to achieve anything, would it be outlandish to suggest one of them is using it right whereas the other is using it wrong? Or do you expect the output to not reflect the input at all?
ares623•3h ago
kykat•3h ago
For LLMs to be effective, you (or something else) needs to constantly find the errors and fix it.
ninetyninenine•3h ago
kykat•2h ago
===
In equilibrium, the probability of leaving the Success state must equal the probability of entering it.
Let P(S) be the probability of being in Success and P(F) be the probability of being in Failure. Since P(S) + P(F) = 1, we can say P(F) = 1 - P(S). Substituting that in:darkwater•1h ago
magicalhippo•1h ago
You still have to have a hand on the wheel, but it helps a fair bit.
Alex_L_Wood•3h ago
Are they doing the same thing? Are they trying to achieve the same goals, but fail because one is lacking some skill?
One person may be someone who needs a very basic thing like creating a script to batch-rename his files, another one may be trying to do a massive refactoring.
And while the former succeeds, the latter fails. Is it only because someone doesn't know how to use agentic AI, or because agentic AI is simply lacking?
berkes•3h ago
* strictness of the result - a personal blog entry vs a complex migration to reform a production database of a large, critical system
* team constraints - style guides, peer review, linting, test requirements, TDD, etc
* language, frameworks - quick node-js app vs a java monolyth e.g.
* legacy - a 12+ year Django app vs a greenfield rust microservice
* context - complex, historical, nonsensical business constraints and flows vs a simple crud action
* example body - a simple crud TODO in PHP or JS, done a million times vs a event-sourced, hexagonal architecrtured, cryptographical signing system for govt data.
leptons•3h ago
For example, I was working on the same kind of change across a few dozen files. The prompt input didn't change, the work didn't change, but the "AI" got it wrong as often as it got it right. So was I "using it wrong" or was the "AI" doing it wrong half the time? I tried several "AI" offerings and they all had similar results. Ultimately, the "AI" wasted as much time as it saved me.
timschmidt•3h ago
"You're using it wrong" and "It could work better than it does now" can be true at the same time, sometimes for the same reason.
thundoe•3h ago
ZeWaka•3h ago
gwd•3h ago
1. The tool is capable of doing more than OP has been able to make it do
2. The tool is not capable of doing more than OP has been able to make it do.
If #1 is true, then... he must be using it wrong. OP specifically said:
> Please pour in your responses please. I really want to see how many people believe in agentic and are using it successfully
So, he's specifically asking people to tell him how to use it "right".