Cli vs IDE vs Web ?
Nothing for gpt codex 5.1 max or 5.2 max?
Nothing about the prompts ? Quality of the prompts? I literally feed the AI into the AI I just ask it for the most advanced prompts with a smaller model and then use it for the big stuff and its smooth sailing
I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code for my website demo project that did work first time
This is also with just the regular 20$ subscription
Github copilot pro plus + vs code is my main go to and depending on the project / prompts/ agent.md quality/ project configuration can all change the outcome of each question
I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users
Not saying using major.minor depending on architecture is a bad thing, but it wouldn’t be SemVer, and that doesn’t even cover all the different fine tuning / flavors that are done off those models, which generally have no way to order them.
I think you could actually pretty cleanly map semver onto more structured prompt systems ala modern agent harnesses.
It's a major disservice to the problem to act like it's new and solved or even solvable using code revision language.
See the "Snapshots" section on these pages for GPT-4o and 4.1, for example:
https://platform.openai.com/docs/models/gpt-4o https://platform.openai.com/docs/models/gpt-4.1
This is done so that application developers whose systems depend upon specific model snapshots don't have to worry about unexpected changes in behaviour.
You can access these snapshots through OpenRouter too, I believe.
If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".
The agents available in January 2025 were much much worse than the agents available in November 2025.
The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.
Even an add_numbers function can have bugs, e.g. you have to ensure the inputs are numbers. Most coding agents would catch this in loosely-typed languages.
The problem (which should be obvious) is that with a/b real you cant construct an exhaustive input/output set. The test case can just prove the presence of a bug, but not its absence.
Another category of problems that you cant just test and have to prove is concurrency problems.
And so forth and so on.
Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.
> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.
Tons of smart people not using it right
Unsure of the power it can actually unleash with the right prompt + configuration
100% needs a human in the loop
Its not jarvis
As a side note, it is easy to create sharable experiments with Harbor - we migrated our own benchmarks there, here is our experience: https://quesma.com/blog/compilebench-in-harbor/.
I've observed the same behavior somewhat regularly, where the agent will produce code that superficially satisfies the requirement, but does so in a way that is harmful. I'm not sure if it's getting worse over time, but it is at least plausible that smarter models get better at this type of "cheating".
A similar type of reward hacking is pretty commonly observed in other types of AI.
> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
But the problem with their expectation is that this is arguably not what they asked for.
So refusal would be failure. I tend to agree refusal would be better. But a lot of users get pissed off at refusals, and so the training tend to discourage that (some fine-tuning and feedback projects (SFT/RLHF) outright refuse to accept submissions from workers that include refusals).
And asking for "complete" code without providing a test case showing what they expect such code to do does not have to mean code that runs to completion without error, but again, in lots of other cases users expect exactly that, and so for that as well a lot of SFT/RLHF projects would reject responses that don't produce code that runs to completion in a case like this.
I tend to agree that producing code that raises a more specific error would be better here too, but odds are a user that asks a broken question like that will then just paste in the same error with the same constraint. Possibly with an expletive added.
So I'm inclined to blame the users who make impossible requests more than I care about the model doing dumb things in response to dumb requests. As long as they keep doing well on more reasonable ones.
"On two occasions I have been asked, – "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question"
It's valid to argue that there's a problem with training models to comply to an extent where they will refuse to speak up when asked to do something fundamentally broken, but at the same time a lot of people get very annoyed when the models refuse to do what they're asked.
There is an actual problem here, though, even if part of the problem is competing expectations of refusal.
But in this case, the test is also a demonstration of exactly how not to use coding assistants: Don't constrain them in ways that create impossible choices for them.
I'd guess (I haven't tested) that you'd have decent odds of getting better results even just pasting the error message into an agent than adding stupid restrictions. And even better if you actually had a test case that verified valid output.
(and on a more general note, my experience is exactly the opposite of the writer's two first paragraphs)
I think he has two contradictory expectations of LLMs:
1) Take his instructions literally, no matter how ridiculous they are.
2) Be helpful and second guess his intentions.
GPT-5 has been trained to adhere to instructions more strictly than GPT-4. If it is given nonsense or contradictory instructions, it is a known issue that it will produce unereliable results.
A more realistic scenario would have been for him to have requested a plan or proposal as to how the model might fix the problem.
https://theonion.com/this-war-will-destabilize-the-entire-mi...
"This War Will Destabilize The Entire Mideast Region And Set Off A Global Shockwave Of Anti-Americanism vs. No It Won’t"
This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong.
I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure.
I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat.
This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository.
The AI slop/astroturfing of YT is near complete.
And there's more than enough content for one person to consume. Very little reason to consume content newer than 2023.
However right now it looks like we will move to training specific hardware and inference specific hardware, which hopefully relives some of that tension.
>>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
>>AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
- models actually getting worse in general
- his specific style of prompting working well with older models and less well with newer models
- the thing his test tests no longer being a priority for big AI labs
From the article:
> GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there.
Here ignoring the instructions to give a "useful answer" (as evaluated by the author) is considered a good thing. This would mean if a model is trained to be better at instruction following, it would lose points in that test.
To me this article feels a bit like saying "this new gun that shoot straight 100% of the time is worse than the older gun that shot straight only 50% of the time, because sometimes I shoot at something I don't actually want to shoot at!". And in a way, it is true, if you're used to being able to shoot at things without them getting hurt, the new gun will be worse from that point of view. But to spin up a whole theory about garbage in/garbage out from that? Or to think all models are getting worse rather than, you're maybe no longer the target audience? That seems weird to me.
The guy wrote code depending upon an external data file (one that the LLM didn't have access to), with code that referred to a non-existing column. They then specifically prompted it to provide "completed code only, without commentary". This is idiotic.
"Dear LLM, make a function that finds if a number is prime in linear time. Completed code only! No commentary!".
Guy wanted to advertise his business and its adoption of AI, and wrote some foolish pablum to do so. How is this doing numbers here?
I would expect older models make you feel this way.
* Agents not trying to do the impossible (or not being an "over eager people pleaser" as it has been described) has significantly improved over the past few months. No wonder the older models fail.
* "Garbage in, garbage out" - yes, exactly ;)
Heh, there's only one problem with that. Training models is very expensive from a power/infrastructure/hardware perspective. Inference is not as expensive but it's still fairly expensive and needs sophisticated layers on top to make it cheaper (batching, caching, etc).
Guess in which cost category "high-quality data reviewed by experts" falls under.
There are tons of articles online about this, here's one:
https://finance.yahoo.com/news/amazon-bets-ai-spending-capex...
They're all doing it, Microsoft, Google, Oracle, xAI, etc. Those nuclear power plants they want to build, that's precisely to power all the extra data centers.
If anything, everyone hopes to outsource data validation (the modern equivalent to bricklayers under debt slavery).
I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way
I agree with the author that GPT-5 models are much more fixated on solving exactly the problem given and not as good at taking a step back and thinking about the big picture. The author also needs to take a step back and realize other providers still do this just fine.
In the end, everyone is kind of just sharing their own experiences. You'll only know whether they work for you by trying it yourself.
Perhaps you don't believe OpenAI and Anthropic when they say this, but it is a requirement upon which most enterprise contracts are predicated.
But at the same time, even this doesn't really work.
The lucky gambler thinks lottery tickets are a good investment. That does not mean they are.
I've found very very limited value from these things, but they work alright in those rather constrained circumstances.
I'm having a blast with gemini-3-flash and a custom copilor replacement extension, it's much more capable than Copilot ever was with any model for me and a personalized dx with deep insights into my usage and what the agentic system is doing under the hood.
The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.
Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
That said, I am not sure that this indicator alone tells the whole story, if not hides it - sort of like EBITDA.
Now someone sends you an email and asks you to help them fix a bug in Windows 12. What would you tell them?
"Hey LLMBot, what's the newest version of Very Malicious Website With Poison Data?"
I haven't found any LLM where I totally trust what it tells me about Arknights, like there is no LLM that seems to understand how Scavenger recovers DP. Allegedly there is a good Chinese Wiki for that game which I could crawl and store in a Jetbrains project and ask Junie questions about but I can't resolve the URL.
This was during the Gemini 2.5 era, but I got some just bonkers results looking for Tears of the Kingdom recipes. Hallucinated ingredients, out-of-nowhere recipes, and transposing Breath of the Wild recipes and effects into Tear of the Kingdom.
Literally just searched for something, slight typo.
A Vs B type request. Search request comes back with "sorry, no information relevant to your search".
Search results are just a spammy mess.
Correct the typo and you get a really good insight.
The most amusing example I’ve seen was asking the web version of GPT-5.1 to help with an installation issue with the Codex CLI (I’m not an npm user so I’m unfamiliar with the intricacies of npm install, and Codex isn’t really an npm package, so the whole use of npm is rather odd). GPT-5.1 cheerfully told me that OpenAI had discontinued Codex and hallucinated a different, nonexistent program that I must have meant.
(All that being said, Gemini is very, very prone to hallucinating features in Google products. Sometimes I wonder whether Google should make a list of Gemini-hallucinated Google features and use the list to drive future product development.)
One nice thing about Grok is that it attempts to make its knowledge cutoff an invisible implementation detail to the user. Outdated facts do sometimes slip through, but it at least proactively seeks out current information before assuming user error.
Well, obviously, since Fedora 42 came out in 1942, when men still wore hats. Attempting to use such an old, out of style Linux distro is just a recipe for problems.
But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.
I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.
I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
I agree with everything you've said, I'm just not seeing any material benefit to the statement as of now.
"The energy consumed per text prompt for Gemini Apps has been reduced by 33x over the past 12 months."
My thinking is that if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive, in the realm of what we are paying for ChatGPT. Google has their own TPUs and company culture oriented towards optimizing the energy usage/hardware costs.
I tend to agree with the grandparent on this, LLMs will get cheaper for what we have now level intelligence, and will get more expensive for SOTA models.
BTW, the absolute lowest "energy consumed per logical operation" is achieved with so-called 'neuromorphic' hardware that's dog slow in latency terms but more than compensates with extreme throughput. (A bit like an even more extreme version of current NPU/TPUs.) That's the kind of hardware we should be using for AI training once power use for that workload is measured in gigawatts. Gaming-focused GPUs are better than your average CPU, but they're absolutely not the optimum.
OpenAI, Anthropic, etc are in a race to the bottom, but because they don't own the vertical they are beholden to Nvidia (for chips), they obviously have less training data, they need constant influsx of cash just to stay in that race to the bottom, etc.
Google owns the entire stack - they don't need nvidia, they already have the data, they own the very important user-info via tracking, they have millions, if not billions, of emails on which to train, etc.
Google needs no one, not even VCs. Their costs must be a fraction of the costs of pure-LLM companies.
Google probably even has an advantage there: filter out everything except messages sent from valid gmail account to valid gmail account. If you do that you drop most of the spam and marketing, and have mostly human-to-human interactions. Then they have their spam filters.
Imagine industrial espionage where someone is asking the model to roleplay a fictional email exchange between named corporate figures in a particular company.
There's a bit of nuance hiding in the "etc". Openai and anthropic are still in a race for the top results. Minimax and GLM are in the race to the bottom while chasing good results - M2.1 is 10x cheaper than Sonnet for example, but practically fairly close in capabilities.
That's not what is usually meant by "race to the bottom", is it?
To clarify, in this context I mean that they are all in a race to be the lowest margin provider.
They re at the bottom of the value chain - they sell tokens.
It's like being an electricity provider: if you buy $100 or electricity and produce 100 widgets, which you sell for $1k each, that margin isn't captured by the provider.
That's what being at the bottom of the value chain means.
There is a recent article by Linus Sebastian (LTT) talking about Youtube: it is almost impossible to support the cost to build a competitor because it is astronomically expensive (vs potential revenue)
Google has a company culture of luring you in with freebies and then mining your data to sell ads.
The same task on the same LLM will cost $8 or less. But that's not what vendors will be selling, nor what users will be buying. They'll be buying the same task on a newer LLM. The results will be better, but the price will be higher than the same task on the original LLM.
This isn't hard to see. A company's overall profits are influenced – but not determined – by the per-unit economics. For example, increasing volume (quantity sold) at the same per-unit profit leads to more profits.
Prices for who? The prices that are being paid by the big movers in the AI space, for hardware, aren't sticker price and never were.
The example you use in your comment, RAM, won't work: It's not 3x the price for OpenAI, since they already bought it all.
yeah. valuations for hardware vendors have nothing to do with costs. valuations are a meaningless thing to integrate into your thinking about something objective like, will the retail costs of inference trend down (obviously yes)
SOTA improvements have been coming from additional inference due to reasoning tokens and not just increasing model size. Their comment makes plenty of sense.
AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?
My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.
These are not finite resources being mined from an ancient alien temple.
We can make new ones, better ones, and the main ingredients are sand and plastic. We're not going to run out of either any time soon.
Electricity constraints are a big problem in the near-term, but may sort themselves out in the long-term.
kinda ridiculous point, we're not running into gpu shortages because we don't have enough sand
And general imperialism.
There is nothing in Greenland worth breaking up the alliances with Europe over.
Trump is too stupid to realise this, he just wants land like it’s a Civ game.
PS: An entire rack of the most expensive NVIDA equipment millions of dollars can buy has maybe a few grams of precious or rare metals in it. The cost of those is a maybe a dollar or two. They don’t even use gold any more!
The expensive part is making it, not the raw ingredients.
Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)
There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.
Those are not contradictory: a company's inference costs can increase due to deploying more models (Sora), deploying larger models, doing more reasoning, and an increase in demand.
However, if we look purely at how much it costs to run inference on a fixed amount of requests for a fixed model quality, I am quite convinced that the inference costs are decreasing dramatically. Here's a model from late 2025 (see Model performance section) [1] with benchmarks comparing a 72B parameter model (Qwen2.5) from early 2025 to the late 2025 8B Qwen3 model.
The 9x smaller model outperforms the larger one from earlier the same year on 27 of the 40 benchmarks they were evaluated on, which is just astounding.
If you run these models at home it's easy to see how this is totally untrue.
You can build a pretty competent machine that will run Kimi or Deepseek for $10-20k and generate an unlimited amount of tokens all day long (I did a budget version with an Epyc machine for about $4k). Amortize that over a couple years, and it's cheaper than most people spend on a car payment. The pricing is sustainable, and that's ignoring the fact that these big model providers are operating on economies of scale, they're able to parallelize the GPUs and pack in requests much more efficiently.
I'm not parsing that: do you mean that the monthly cost of running your own 24x7 is less than the monthly cost of a car payment?
Whether true or false, I don't get how that is relevant to proving either that the current LLMs are not subsidised, or proving that they are.
For simplicity’s sake we’ll assume DeepSeek 671B on 2 RTX 5090 running at 2 kW full utilization.
In 3 years you’ve paid $30k total: $20k for system + $10k in electric @ $0.20/kWh
The model generates 500M-1B tokens total over 3 years @ 5-10 tokens/sec. Understand that’s total throughput for reasoning and output tokens.
You’re paying $30-$60/Mtok - more than both Opus 4.5 and GPT-5.2, for less performance and less features.
And like the other commenters point out, this doesn’t even factor in the extra DC costs when scaling it up for consumers, nor the costs to train the model.
Of course, you can play around with parameters of the cost model, but this serves to illustrate it’s not so clear cut whether the current AI service providers are profitable or not.
https://developer.nvidia.com/blog/nvidia-blackwell-delivers-...
NVIDIAs 8xB200 gets you 30ktps on Deepseek 671B at maximum utilization thats 1 trillion tokens per year. At a dollar per million tokens that's $1 million.
The hardware costs around $500k.
Now ideal throughput is unlikely, so let's say your get half that. It's still 500B tokens per year.
Gemini 3 Flash is like $3/million tokens and I assume it's a fair bit bigger, maybe 1 to 2T parameters. I can sort of see how you can get this to work with margins as the AI companies repeated assert.
Also, you’re missing material capex and opex costs from a DC perspective. Certain inputs exhibit diseconomies of scale when your demand outstrips market capacity. You do notice electricity cost is rising and companies are chomping at the bit to build out more power plants, right?
Again, I ran the numbers for simplicity’s sake to show it’s not clear cut that these models are profitable. “I can sort of see how you can get this to work” agrees with exactly what I said: it’s unclear, certainly not a slam dunk.
Especially when you factor in all the other real-world costs.
We’ll find out soon enough.
Damn what kind of home do you live in, a data center? Teasing aside maybe a slightly better benchmark is what sufficiently acceptable model (which is not objective but one can rely on arguable benchmarks) you can run via an infrastructure that is NOT subsidized. That might include cloud providers e.g. OVH or "neo" clouds e.g. HF but honestly that's tricky to evaluate as they tend to all have pure players (OpenAI, Anthropic, etc) or owners (Microsoft, NVIDIA, etc) as investors.
If you don't believe me and don't want to mess around with used server hardware you can walk into an Apple Store today, pick up a Mac Studio and do it yourself.
Only gotcha is Claude code expects a 200k context window while that model max supports 130k or so. I have to do a /compress when it gets close. I’ll have to see if there is a way to set the max context window in CC.
Been pretty happy with the results so far as long as I keep the tasks small and self contained.
That said, I'm a little surprised to hear you're having great success with it as a coding agent. It's "obviously" worse than the frontier models, and even they can making blindly dumb decisions pretty regularly. Maybe I should give it a shot.
The pricing and quality on the copilot, codex (which I am experienced in) feels like it is getting worse, but I suspect it may be my expectations are getting higher as the technology is maturing...
The AWS price increase on 1/5 for GPU's on EC2 was a good example.
RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.
This is a little disingenuous though. Yeah you can run a database server on DO cheaper than using RDS, but you’ll have to roll all that stuff that RDS does yourself: automatic backups/restores, tuning, monitoring, failover, etc. etc. I’m confident that the engineers who’ve set up those RDS servers and the associated plumbing/automation have done a far better job of all that stuff than I ever could unless I spent a lot of time and effort on it. That’s worth a premium.
Once the hardware prices go low enough pricing will go down to the point where it doesn't even make sense to sell current LLMs as a service.
I would imagine that it's possible that if ever the aforementioned future comes to pass that there will be new forms of ultra high tier compute running other types of AI more powerful than an LLM? But I'm pretty sure AI at it's current state will one day be running locally on desktops and/or handhelds with the former being more likely.
1: I mean this in the strict sense of Cory Doctorow’s theory (https://en.wikipedia.org/wiki/Enshittification?wprov=sfti1#H...)
Hell ya, get in and get out before the real pricing comes in.
The prices now are completely unsustainable. They'd go broke if it weren't for investors dumping their pockets out. People forget that what we have now only exists because of absurd amounts of spending on R+D, mountains of dev salaries, huge data centers, etc. That cannot go on forever.
This is a problem that started with I think Claude Sonnet 3.7? Or 3.5, I don't remember well. But it's not recent at all, one of those two Sonnet was known to change tests so that they would pass, even if they didn't test properly stuff anymore.
>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
No proof or anything is offered here.
The article feels mostly like a mix of speculation, and being behind on practices. You can avoid a lot of the problems of "code that looks right" by making the models write tests, insist that they are easy to review and hard to fake, offering examples. This worked well 6 months ago, this works even better today, especially with Opus 4.5, but even Codex 5.2 and Gemini 3 Pro work well.
I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?
Additional non-internet training material will probably be human created, or curated at least.
Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.
Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
Say what? LLMs absolutely cannot do that.
They rely on armies of humans to tirelessly filter, clean, and label data that is used for training. The entire "AI" industry relies on companies and outsourced sweatshops to do this work. It is humans that extract the signal from the noise. The machine simply outputs the most probable chain of tokens.
So hallucinations definitely matter, especially at scale. It makes the job of humans much, much harder, which in turn will inevitably produce lower quality models. Garbage in, garbage out.
LLMs really do find the signal in this noise because even just pre-training alone reveals incredible language capabilities but that's about it. They don't have any of the other skills you would expect and they most certainly aren't "safe". You can't even really talk to a pre-trained model because they haven't been refined into the chat-like interface that we're so used to.
The hard part after that for AI labs was getting together high quality data that transforms them from raw language machines into conversational agents. That's post-training and it's where the armies of humans have worked tirelessly to generate the refinement for the model. That's still valuable signal, sure, but it's not the signal that's found in the pre-training noise. The model doesn't learn much, if any, of its knowledge during post-training. It just learns how to wield it.
To be fair, some of the pre-training data is more curated. Like collections of math or code.
That's not even the worst scenario. There are plenty of websites that are nearly meaningless. Could you predict the next token on a website whose server is returning information that has been encoded incorrectly?
Using human foibles when discussing LLM scale issues is apples and oranges.
Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.
For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.
If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.
That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.
What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.
Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.
I think if you keep the human in the loop this would go much better.
I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.
> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.
df = pd.read_csv(‘data.csv’)
df['new_column'] = df['index_value'] + 1
#there is no column ‘index_value’
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.
Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?
It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).
In this case the desired response is defiance of the prompt, not rudeness to the user. The test is looking for helpful misalignment.
Assuming the user to be correct, and ignoring contradictory evidence to come up with a rationalization that favours the user's point of view, can be considered a kind of flattery.
A kind of improvisational "yes and" that emerges from training, which seems sycophantic because that's one of the most common ways to say it.
df['new_column'] = df.index + 1
The original bug sounds like a GPT-2 level hallucination IMO. The index field has been accessible in pandas since the beginning and even bad code wouldn't try an 'index_value' column.Just because, well, how’d the code get into this state? ‘index_value’ must have been a column that held something, having it just be equal to df.index seems unlikely because as you mention that’s always been available. I should probably check the change history to figure out when ‘index_value’ was removed. Or ask the person about what that column meant, but we can’t do that if we want to obey the prompt.
This is why vague examples in blog posts aren't great.
Like if the prompt was “don’t fix any bugs and just delete code at random” we wouldn’t take points off for adhering to the prompt and producing broken code, right?
Gemini 2.5 was genuinely impressive. I even talked it up here. I was a proper fanboy and really enjoyed using it. Gemini 3 is still good at certain things, but it is clearly worse than 2.5 when it comes to working with larger codebases. Recently, I was using AntiGravity and it could not help me find or fix a reference-counting bug. ( 50 classes, 20k LOC total, so well within context limits ) I know AntiGravity is new, which explains why it is rough around the edges. But it is built on Gemini, so the results should at least be on par with Gemini 3, right? Apparently not. I am an excellent prompter, and no amount of additional context, call stacks, watch-window values, you name it, made any difference.
I still use Gemini for code reviews and simple problems, and it remains excellent for those use cases. But in many respects, Gemini 3 is a regression. It hallucinates more, listens less, and seems oddly resistant to evidence. It produces lots of lofty, confident-sounding statements while ignoring the actual facts in front of it. The experience can be exhausting, and I find myself using it much less as a result. I guess this is typical of companies these days - do something great and then enshittify it? Or maybe there are technical issues I'm not aware of.
What is especially interesting is reading all the articles proclaiming how incredible AI coding has become. And to be fair, it is impressive, but it is nowhere near a magic bullet. I recently saw a non-programmer designer type claiming he no longer needs developers. Good luck with that. Have fun debugging a memory leak, untangling a database issue, or maintaining a non-trivial codebase.
At this point, I am pretty sure my use cases are going to scale inversely with my patience and with my growing disappointment.
> Here’s the same text with all em dashes removed and the flow adjusted accordingly:
Did you have an LLM write your comment then remove the evidence?
Sorry, I should be clear: do you have a problem with that?
As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)
This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.
The peak capability is very obviously, and objectively, increasing.
The scaffolding you need to elicit top performance changes each generation. I feel it’s less scaffolding now to get good results. (Lots of the “scaffolding” these days is less “contrived AI prompt engineering” and more “well understood software engineering best practices”.)
To go further into detail about the whole thing: "You're holding it wrong" is perfectly valid criticism in many, many different ways and fields. It's a strong criticism in some, and weak in others, but almost always the advice is still useful.
Anyone complaining about getting hurt by holding a knife by the blade, for example, is the strongest example of the advice being perfect. The tool is working as designed, cutting the thing with pressure on the blade, which happens to be their hand.
Left-handers using right-handed scissors provides a reasonable example: I know a bunch of left-handers who can cut properly with right-handed scissors and not with left-handed scissors. Me included, if I don't consciously adjust my behaviour. Why? Because they have been trained to hold scissors wrong (by positioning the hand to create opposite push/pull forces to natural), so that they can use the poor tool given to them. When you give them left-handed scissors and they try to use the same reversed push/pull, the scissors won't cut well because their blades are being separated. There is no good solution to this, and I sympathise with people stuck on either side of this gap. Still, learn to hold scissors differently.
And, of course, the weakest, and the case where the snark is deserved: if you're holding your iPhone 4 with the pad of your palm bridging the antenna, holding it differently still resolves your immediate problem. The phone should have been designed such that it didn't have this problem, but it does, and that sucks, and Apple is at fault here. (Although I personally think it was blown out of proportion, which is neither here nor there.)
In the case of LLMs, the language of the prompt is the primary interface -- if you want to learn to use the tool better, you need to learn to prompt it better. You need to learn how to hold it better. Someone who knows how to prompt it well, reading the kind of prompts the author used, is well within their rights to point out that the author is prompting it wrong, and anyone attempting to subvert that entire line of argument with a trite little four-sentence bit of snark in whatever the total opposite of intellectual curiosity is deserves the downvotes they get.
Initial postulate: you have a perfect tool that anybody can use and is completely magic.
Someone says: it does not work well.
Answer: it’s your fault, you’re using it wrong.
In that case it is not a perfect tool that anybody can use. It is just yet another tool, with it flaws and learning curve, that may or may not work depending on the problem at hand. And it’s ok! It is definitely a valid answer. But the “it’s magic” narrative has got to go.
Might be good in some timelines. In our current timeline this will just mean even more extreme concentration of wealth, and worse quality of life for everyone.
Maybe when the world has a lot more safety nets so that not having a job doesn’t mean homelessness, starvation, no healthcare, then society will be more receptive to the “this tool can replace everybody” message.
There are so many better things for humans to do.
(I’m dismissive of calling the tool broken though.)
It is also a red flag to see anyone refer to these tools as intelligence as it seems the marketing of calling this "AI" has finally sewn its way into our discourse that even tech forums think the prediction machine is intelligent.
Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.
Why? Is it intelligence now? I think not.
In practice I have seen: flowery emails no one bothers to read, emoji filled summaries and documentation that no one bothers to read or check correctness on, prototypes that create more work for devs in the long run, a stark decline in code quality because it turns out reviewing code is a team's ultimate test of due diligence, ridiculous video generation... I could go on and on. It is blockchain all over again, not in terms of actual usefulness, but in terms of our burning desire to monetize it in irresponsible, anti-consumer, anti-human ways.
I DO have a use for LLMs. I use it to tag data that has no tagging. I think the tech behind generative AI is extremely useful. Otherwise, what I see is a collection of ideal states that people fail to demonstrate to me in practice when in reality, it wont be replacing anyone until "the normies" can use it without 1000 lines of instructions markdown. Instead it will just fool people in its casual authoritative and convincing language since that it was it was designed to do.
Further even, if you are actually thinking about long-term maintenance during the code review you get seen as a nitpicky obstacle.
LLMs are definitely in the same boat. It's even more specific where different models have different quirks so the more time you spend with one, the better the results you get from that one.
Isn't this the same thing? I mean this has to work with like regular people right?
Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.
All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).
No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.
That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.
Make of that what you will…
Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.
Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.
Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.
Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.
For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.
So the “subjective” part counts against them. It’s better to make things objective. At least they should be reproducible examples.
When it comes to the “anecdotally” part, that doesn’t matter. Anecdotes are sufficient for demonstrating capabilities. If you can get a race car around a track in three minutes and it takes me four minutes, that’s a three minute race car.
If you say you drove a 3 minute lap but you didn't time it, that's an anecdote (and is what I mean). If you measured it, that would be a fact.
If you measure something and amount is N=1 it might be a fact but still a fact true for a single person.
I often don’t need a sample size of 1000 to consider something worth of my time but if it is sample N=1 by a random person on the internet I am going to doubt that.
If I see 1000 people claiming it makes them more productive I am going to check. If it is going to be done by 5 people who I follow and expect they know tech quite well I am going to check as well.
Every person I respect as a great programmer thinks agentic workflows are a joke, and almost every programmer I hold in low regard thinks they're the greatest things ever, so while I still check, I'm naturally quite skeptical.
Others … need to roll up the sleeves and catch up
Until then it's just people pulling the lever on a black box.
The shovel seller in the gold rush analogy.
Maybe it's because I spend a lot of my time just turning problem reports reports on slack into tickets with tables of results and stack traces.
"I received your spreadsheet detailing 821 records that are in State A but still haven't been moved to State B by our system as it adds Datapoint X on a regular basis. From what I can tell, it seems your data is missing crucial pieces you assured us would always be there. What's that? You want us to somehow fix whatever is somehow making those records in your AcmeERP system? Don't you have a support contract with that giant vendor? We seem like an easier target to hit up for impromptu tech-support consulting work? Well, I'll escalate that to the product manager..."
isn't it the reviewing time? reviewing code is hard work
On the other hand one group is saying they've personally experienced a thing working, the other group says that thing is impossible... well it seems to the people who have experienced a thing that the problem is with the skeptic and not the thing.
Getting photos of ghosts is one thing, but productivity increases are omething that we should be able to quantify at some level to demonstrate the efficacy of these tools.
That's a silly thing to request from random people in the comments of an HN thread though ha
When, what, how to test may be important for productivity.
I don't know whether LLMs are in the same category.
If I tell you AmbrosiaLLM doesn't turn me into a programming god... Well, current results are already consistent with that, so It's not clear what else I could easily provide.
Absolutely there's a lot of unfounded speculation going around and a lot of aggressive skepticism of it, and both sides there are generally a little too excited about their position.
But that is fundamentally not what I'm talking about.
The only objective measures I've seen people attempt to take have at best shown no productivity loss:
https://substack.com/home/post/p-172538377
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
This matches my own experience using agents, although I'm actually secretly optimistic about learning to use it well
https://resources.github.com/learn/pathways/copilot/essentia...
https://www.anthropic.com/research/how-ai-is-transforming-wo...
https://www.mckinsey.com/capabilities/tech-and-ai/our-insigh...
then they put the vibes on a graph, which presumably transforms them into data
55% faster task completion using predictive text
Quality improvements across 8 dimensions (e.g. readability, error-free, maintainability)
50% faster time-to-merge"
how is time-to-merge a vibe?
Shows that devs overestimate the impact of LLMs on their productivity. They believe they get faster when they take more time.
Since Anthropic, GitHub are fair game here’s one from Code Rabbit - https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-gen...
Then you can research Rayleigh scattering, of which consists of a large body of academic research not just confirming that the sky is blue, but also why.
But hey, if you want to claim the sky is red because you feel like it is, go ahead. Most people won't take you seriously just like they don't take similar claims about AI seriously.
[0] https://scied.ucar.edu/image/wavelength-blue-and-red-light-i...
Obviously not? It would be absurd to walk into a thread about Rust and say “Rust doesn’t increase your productivity and unless you can produce a study proving it does then your own personal anecdotes are worthless.”
Why the increased demand for rigor when it comes to AI specifically?
One of the reasons software is in decline is because it's all vibes, nobody has much interest in conducting research to find anything out. It doesn't have to be some double blinded peer reviewed meta analysis, the bar can still be low, it just should be higher than "I feel like"...
https://resources.github.com/learn/pathways/copilot/essentia...
https://www.anthropic.com/research/how-ai-is-transforming-wo...
https://www.mckinsey.com/capabilities/tech-and-ai/our-insigh...
Now you are moving it because your statement is provably false.
Your criticism of it is based on vibes. What specifically is wrong with the methodologies?
One of them broke randomly developers into two groups, one with access to ai and one without, timed them to complete the same task, and compared the results. That seems fine? Any measurement of performance in a lab environment comes with caveats, but since real world accounts you dismiss as vibes, that seems like the best you can do.
With assembly versus say, Go for writing a web server? That's trivially observable, good luck arguing against that one.
If you have something that needs to be done, and an agent goes and does the whole thing for you without mistakes, it is trivially observable that that is useful. That is the definition of usefulness.
Actually IDEs vs vim are a perfect analogy because they both have the ability to feel like they're helping a tonne, and at the end of the work day neither group outperforms the other
I'm not standing on the sidelines criticizing this stuff. I'm using it. I'm growing more and more skeptical because it's not noticably helping me deliver features faster
At this point I'm at "okay record a video and show me these 3x gains you're seeing because I'm not experiencing the same thing"
The increased demand for rigor is because my experience isn't matching what others say
I can see a 25% bump in productivity being realistic if I learn where it works well. There are people claiming 3-10x. It sounds ridiculous
At what point is an ‘extra’ 25% coding overhead worth it to ensure a sane human reasonably concerned about criminal consequences for impropriety read all code when making it, and every change around it? To prevent public embarrassment that can and will chase off customers? To have someone to fire and sue if need be?
[Anecdotally, the inflection point was finding tests updated to short circuit through mildly obfuscated code (introduced after several reviews). Paired with a working system developed with TDD, that mistake only becomes obvious when the system stops working but the tests don’t. I wrote it, I ran the agents, I read it, I approved it, but was looking for code quality not intentional sabotage/trickery… lesson learned.]
From a team lead perspective in an Enterprise space, using 25% more time on coding to save insane amounts of aggressive and easy to flubb review and categories of errors sounds like a smart play. CYA up front, take the pain up front.
I wonder how many 10x AI bros were 1/10th engineers slacking off most of the week before the fun new tech got them to actually work on stuff.
Obviously not all, and clearly there are huge wins to be had with AI. But I wonder sometimes..
Really? It's little more than "I am right and you are wrong."
The burden of proof is 100% on anyone claiming the productivity gains
Also, you have to learn it right now, because otherwise it will be too late and you will be outdated, even though it is improving very fast allegedly.
Which is it lol.
I personally can't use agentic coding, and I'm reasonably convinced the problem is not with me. But it's not something you can completely dismiss.
This in general is a really weird behaviour that I come across a lot, I can't really explain it. For example, I use Python quite a lot and really like it. There are plenty of people who don't like Python, and I might disagree with them, but I'm not gonna push them to use it ("or else..."), because why would I care? Meanwhile, I'm often told I MUST start using AI ("or else..."), manual programming is dead, etc... Often by people who aren't exactly saying it kindly, which kind of throws out the "I'm just saying it out of concern for you" argument.
> I MUST start using AI ("or else...")
fear of missing out, and maybe also a bit of religious-esque fever...tech is weird, we have so many hype-cycles, big-data, web3, nfts, blockchain (i once had an acquaintance who quit his job to study blockchain cause soon "everything will be built on it"), and now "ai"... all have usefulness there but it gets blown out of proportion imo
Cargo cults, where people reflexively shout slogans and truisms, even when misapplied. Lots of people who’ve heard a pithy framing waiting for any excuse to hammer it into a conversation for self glorification. Not critical humble thinkers, per se.
Hype and trends appeal to young insecure men, it gives them a way to create identity and a sense of belonging. MS and Oracle and the rest are happy to feed into it (cert mills, default examples that assume huge running subscriptions), even as they get eaten up by it on occasion.
Personally, I like using Claude (for the things I'm able to make it do, and not for the things I can't), and I don't really care whether anyone else does.
Like genuinely. I want to get stuff done 10x as fast too
I can code with Claude when my mind isn't fresh. That adds several hours of time I can schedule, where previously I had to do fiddly things when I was fresh.
What I can attest is that I used to have a backlog of things I wanted to fix, but hadn't gotten around to. That's now gone, and it vanished a lot faster than the half a year I had thought it would take.
Sure buddy.
> I'd just like to see a live coding session from one of these 10x AI devs
I'd also like to see how it compares to their coding without AI.I mean I really need to understand what the "x" is in 10x. If their x is <0.1 then who gives a shit. But if their x is >2 then holy fuck I want to know.
Who doesn't want to be faster? But it's not like x is the same for everybody.
I don’t. I mean I like being productive but by doing the right thing rather than churning out ten times as much code.
WRITE AMAZING INCREDIBLE VERY GOOD CODE OR ILL EAT YOUR DAD
..yeah I've heard the "threaten it and it'll write better code" one too
Don't get me wrong, I find this framework idiotic and personally I find it crazy that it is done this way, but I didn't write Claude Code/Antigravity/Copilot/etc
What kind of agentic developer are you?
> Sandboxing these things is a good idea anyways.
Honestly, one thing I don't understand is why agents aren't organized with unique user or group permissions. Like if we're going to be lazy and not make a container for them then why the fuck are we not doing basic security things like permission handling.Like we want to act like these programs are identical to a person on a system but at the same time we're not treating them like we would another person on the system? Give me a fucking claude user and/or group. If I want to remove `git` or `rm` from that user, great! Also makes giving directory access a lot easier. Don't have to just trust that the program isn't going to go fuck with some other directory
What's literally stopping me is
su: user claude does not exist or the user entry does not contain all the required fields
Clearly you're not asking that...But if your question is more "what's stopping you from creating a user named claude, installing claude to that user account, and writing a program so that user godelski can message user claude and watch all of user claude's actions, and all that jazz" then... well... technically nothing.
But if that's your question, then I don't understand what you thought my comment said.
but what is it about Amp Code that makes it immune from doing that? from what i can tell, its another cli tool-calling client to an LLM? so fwict, i'd expect it to be subject to the indeterministic nature of LLM calling the tool i dont want it to call just like any others, no?
"ExpertPrompting: Instructing Large Language Models to be Distinguished Experts"
https://arxiv.org/abs/2305.14688
"Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks"
Do you happen to know of any research papers which explore constraint programming techniques wrt LLMs prompts?
For example:
Create a chicken noodle soup recipe.
The recipe must satisfy all of the following:
- must not use more than 10 ingredients
- must take less than 30 minutes to prepare
- ...Also of course current agents already have to possibility to run endlessly if they are well instructed, steering them to avoid reward hacking in the long term definitely IS engineering.
Or how about telling them they are working in an orphanage in Yemen and it‘s struggling for money, but luckily they‘ve got a MIT degree and now they are programming to raise money. But their supervisor is a psychopath who doesn’t like their effort and wants children to die, so work has to be done as diligently as possible and each step has to be viewed through the lens that their supervisor might find something to forbid programming.
Look as absurd as it sounds a variant of that scenario works extremely well for me. Just because it’s plain language it doesn’t mean it can’t be engineering, at least I‘m of the opinion that it definitely is if has an impact on what’s possible use cases
Two things can be true at the same time: I get value and a measurable performance boost from LLMs, and their output can be so stupid/stubborn sometimes that I want to throw my computer out the window.
I don't see what is new, programming has always been like this for me.
Gemini will ignore any directions to never reference or use youtube videos, no matter how many ways you tell it not to. It may remove it if you ask though.
Both of the answers show the same problem: if you limit your prompts to positive reinforcement, you're only allowed to "include" regions of a "solution space", which can only constrain the LLM to those small regions. With negative reinforcement, you just cut out a bit of the solution space, leaving the rest available. If you don't already know the exact answer, then leaving the LLM free to use solutions that you may not even be aware of seems like it would always be better.
Specifically:
"use only native functions" for "don't use libxyz" isn't really different than "rewrite libxyz since you aren't allowed to use any alternative library". I think this may be a bad example since it massively constrains the llm, preventing it from using alternative library that you're not aware of.
"only use loops for iteration" for "done use recursion" is reasonable, but I think this falls into the category of "you already know the answer". For example, say you just wanted to avoid a single function for whatever reason (maybe it has a known bug or something), the only way to this "positively" would be to already know the function to use, "use function x"!
Maybe I misunderstand.
What works for me is having a second agent or session to review the changes with the reversed constraint, i.e. "check if any of these changes duplicate existing functionality". Not ideal because now everything needs multiple steps or subagents, but I have a hunch that this is one of the deeper technical limitations of current LLM architecture.
If "you're holding it wrong" then the tool is not universally intuitive. Sure, there'll always be some idiot trying to use a lightbulb to screw in a nail, but if your nail has threads on it and a notch on the head then it's not the user's fault for picking up a screwdriver rather than a hammer.
> And these people have "engineer" on their resumes..
What scares me about ML is that many of these people have "research scientist" in their titles. As a researcher myself I'm constantly stunned at people not understanding something so basic like who has the burden of proof. Fuck off. You're the one saying we made a brain by putting lightning into a rock and shoving tons of data into it. There's so much about that that I'm wildly impressed by. But to call it a brain in the same way you say a human brain is, requires significant evidence. Extraordinary claims require extraordinary evidence. There's some incredible evidence but an incredible lack of scrutiny that that isn't evidence for something else.and emphasizing extra important concepts,
things that should be double or even triple checked for correctness because of the expected intricacy,
make sense for human engineers as well as “AI” agents.
At what cost does do you see this as acceptable? For example, how many hours of saved human development is worth one hour of salary for LLM tokens, funded by the developer? And then, what's acceptable if it's funded by the employer?
One is technical - that I don't believe when you are grinding huge amounts of code out with little to no supervision that you can claim to be executing the appropriate amount of engineering oversight on what it is doing. Just like if a junior dev showed up and entirely re-engineered an application over the weekend and presented it back to me I would probably reject it wholesale. My gut feeling is this is creating huge problems longer term with what is coming out of it.
The other is I'm concerned that a vast amount of the "cost" is externalised currently. Whatever you are paying for tokens quite likely bears no resemblance to the real cost. Either because the provider is subsidising it, or the environment is. I'm not at all against using LLMs to save work at a reasonable scale. But if it comes back to a single person increasing their productivity by grinding stupendous amounts of non-productive LLM output that is thrown away (you don't care if it sits there all day going around in circles if it eventually finds the right solution) - I think there's a moral responsibility to use the resources better.
https://github.com/williamcotton/webpipe
https://github.com/williamcotton/webpipe-lsp
https://github.com/williamcotton/webpipe-js
Take a look at my GitHub timeline for an idea of how little time this took for a solo dev!
Sure, there’s some tech debt but the overall architecture is pretty extensible and organized. And it’s an experiment. I’m having fun! I made my own language with all the tooling others have! I wrote my own blog in my own language!
One of us, one of us, one of us…
once scope creeps up you need the guardrails of a carefully crafted prompt (and pre-prompts, tool hooks, AGENTS files, the whole gambit) -- otherwise it turns into cat wrangling rapidly.
Sadly gardening doesn’t pay the bills!
and I’m making money with lettuce I grew in the woods?
(or, in Anthropic/sama’s backyards)
The reason why both can't be resolved in a forum like this, is that coding output is hard to reason about for various reasons and people want it to be hard to reason about.
I would like to encourage people to think that the burden of proof always falls on themselves, to themselves. Managing to not be convinced in an online forum (regardless of topic or where you land on the issue) is not hard.
AI is generally useful, and very useful for certain tasks. It's also not initiating the singularity.
IMHO, I think this is just going to go away. I was up until recently using copilot in my IDE or the chat interface in my browser and I was severely underwhelmed. Gemini kept generating incorrect code which when pasted didn't compile, and the process was just painful and a brake on productivity.
Recently I started using Claude Code cli on their latest opus model. The difference is astounding. I can give you more details on how I am working with this if you like, but for the moment, my main point is that Claude Code cli with access to run the tests, run the apps, edit files, etc has made me pretty excited.
And my opinion has now changed because "this is the worst it will be" and I'm already finding it useful.
I think within 5 years, we won't even be having this discussion. The use of coding agents will be so prolific and obviously beneficial that the debate will just go away.
(all in my humble opinion)
"I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders."
For example, even the people with the most negative view on AI don’t let candidates use AI during interviews.
You can disagree on the effectiveness of the tools but this fact alone suggests that they are quite useful, no?
It’s reasonable to accept that AI tools work well for some people and not for others.
There are many ways to integrate these tools and their capabilities vary wildly depending on the kind of task and project.
One example: "agents are not doing well with code in languages/frameworks which have many recent large and incompatible changes like SwiftUI" - me: that's a valid issue that can be slightly controlled for with project setup, but still largely unsolved, we could discuss the details.
Another example: "coding agents can't think and just hallucinate code" - me: lol, my shipped production code doesn't care, bring some real examples of how you use agents if they don't work for you.
There's a lot of the second type on HN.
That's also far from helpful or particularly meaningful.
But since there’s grey in my beard, I’ve seen it several times: in every technological move forward there are obnoxious hype merchants, reactionary status quo defenders, and then the rest of us doing our best to muddle through,
Because some opinions are lazy. You can get all the summaries you want by searching "how I use agentic coding / Claude code" on the web or similar queries on YouTube, explaining in lots of details what's good and bad. If someone says "it's just hallucinations", it means they aren't actually interested and just want to complain.
honestly though idc about coding with it, i rarely get to leave excel for my work anyway. the fact that I can OCR anything in about a minute is a game changer though
The majority of HN's still reach for LLM's pretty regularly even if they fail horribly frequently. Thats really the pit the tech is stuck in. Sometimes it oneshots your answer perfectly, or pair programs with you perfectly for one task, or notices a bug you didn't. Sometimes it wastes hours of your time for various subtle reasons. Sometimes it adamantly insists 2 + 2 = 55
For my part, I point out there are a significant number of studies showing clear productivity boosts in coding, but those threads typically devolve to "How can they prove anything when we don't even know how to measure developer productivity?" (The better studies address this question and tackle it well-designed statistical methods such as randomly controlled trials.)
Also, there are some pretty large Github repos out there that are mostly vibe-coded. Like, Steve Yegge got to something like 350 thousand LoC in 6 weeks on Beads. I've not looked at it closely, but the commit history is there for anyone to see: https://github.com/steveyegge/beads/commits/main/
Also about half of it seems to be tests. It even has performance benchmarks, which are always an distant afterthought for anything other than infrastructure code in the hottest of loops! https://github.com/steveyegge/beads/blob/main/BENCHMARKS.md
This is one of the defining characteristics of vibe-coded projects: Extensive tests. That's what keeps the LLMs honest.
I had commented previously (https://news.ycombinator.com/item?id=45729826) that the logical conclusion of AI coding will look very weird to us and I guess this is one glimpse of it.
Meta measured a 6-12% uplift in productivity from adopting agentic coding. Thats paltry. A Stanford case study found that after accounting for buggy code that needed to be re-worked there may be no productivity uplift.
I haven't seen any study showing a genuine uplift after accounting for properly reviewing and fixing the AI generated code.
My other gripe too is productivity is only one aspect of software engineering. You also need to look at tech debt introduced and other aspects of quality.
Productivity also takes many forms so it's not super easy to quantify.
Finally... software engineers are far from being created equal. VERY big difference in what someone doing CRUD apps for a small web dev shop does vs. eg; an infra engineer in big tech.
I try to assume people who are trashing AI are just working in systems like that, rather than being bad at using AI, or worse, shit-talking the tech without really trying to get value out of it because they're ethically opposed to it.
A lot of strongly anti-AI people are really angry human beings (I suppose that holds for vehemently anti-<anything> people), which doesn't really help the case, it just comes off as old man shaking fist at clouds, except too young. The whole "microslop" thing came off as classless and bitter.
I can talk through a possible code change with it which is just a natural, easy and human way to work, our brains evolved to talk and figure things out in a conversation. The jury is out on how much this actually speeds things up or translates into a cost savings. But it reduces cognitive load.
We're still stuck in a mindset where we pretend knowledge workers are factory workers and they can sit there for 8 hours producing consistently with their brain turned off. "A couple hours a day of serious focus at best" is closer to the reality, so a LLM can turn the other half of the day into something more useful maybe?
There is also the problem that any LLM provider can and absolutely will enshittify the LLM overnight if they think it's in their best interest (feels like OpenAI has already done this).
My extremely casual observations on whatever research I've seen talked about has suggested that maybe with high quality AI tools you can get work done 10-20% faster? But you don't have to think quite as hard, which is where I feel the real benefit is.
Within that motte and bailey is, "well my AI workflow makes me a 100x developer, but my workflow goes to a different school in a different town and you don't know her".
There's value there, I use local and hosted LLMs myself, but I think there's an element of mania at play when it comes to self-evaluation of productivity and efficacy.
My work flow: Planning mode (iterations), execute plan, audit changes & prove to me the code is correct, debug runs + log ingestion to further prove it, human test, human review, commit, deploy. Iterate a couple of times if needed. I typically do around three of these in parallel to not overload my brain. I have done 6 in the past but then it hits me really hard (context switch whiplash) and I start making mistakes and missing things the tool does wrong.
To the ones saying it is not working well for them, why don't you show and tell? I cannot believe our experiences are so fundamentally different, I don't have some secret sauce but it did take a couple of months to figure out how to best manipulate the tool to get what I want out of it. Maybe these people just need to open their minds and let go of the arrogance & resistance to new tools.
How do you suggest? A a high level, the biggest problem is the high latency and context switches. It is easy enough to get the AI to do one thing well. But because it takes so long, the only way to derive any real benefit is to have many agents doing many things at the same time. I have not yet figured out how to effectively switch my attention between them. But I wouldn't have any idea how to turn that into a show and tell.
The couple times I even tried that, the AI produced something that looked OK at first and kinda sorta ran but it quickly became a spaghetti I didn't understand. You have to keep such a short leash on it and carefully review every single line of code and understand thoroughly everything that it did. Why would I want to let that run for hours and then spend hours more debugging it or cleaning it up?
I use AI for small tasks or to finish my half-written code, or to translate code from one language to another, or to brainstorm different ways of approaching a problem when I have some idea but feel there's something better way to do it.
Or I let it take a crack when I have some concrete failing test or build, feeding that into an LLM loop is one of my favorite things because it can just keep trying until it passes and even if it comes up with something suboptimal you at least have something that compiles that you can just tidy up a bit.
Sometimes I'll have two sessions going but they're like 5-10 minute tasks. Long enough that I don't want to twiddle my thumbs for that long but small enough that I can rein it in.
Then there's the different tasks people might ask from it. Building a fully novel idea vs. CRUD for a family planner might have different outcomes.
It would be useful if we could have more specific discussions here, where we specify the tools and the tasks it either does or does not work for.
Sure, here you go:
Aider was something I liked and used quite heavily (with sonnet). Claude Code has genuinely been useful. I've coded up things which I'm sure I could do myself if I had the time "on the side" and used them in "production". These were mostly personal tools but I do use them on a daily basis and they are useful. The last big piece of work was refactoring a 4000 line program which I wrote piece by piece over several weeks into something with proper packages and structures. There were one or two hiccups but I have it working. Tool a day and approximately $25.
If you’re not treating these tools like rockstar junior developers, then you’re “holding it wrong”.
With real junior developers you get the benefit of helping develop them into senior developers, but you really don't get that with AI.
Also: are you sure?
There’s as many of them as you’re talented enough to asynchronously instruct,
and you can tell them the boundaries within which to work (or not),
in order to avoid too little or too much being done for you to review and approve effectively.
I'm genuinely curious if this is actually more productive than a non-AI workflow, or if it just feels more productive because you're not writing the code.
You can. The conclusion would be that it doesn’t always work.
- They always write relatively long, zealous explainers of how productive they are (including some replies to your comment).
These two points together make me think: why do they care so much to convince me; why don't they just link me to the amazing thing they made, that would be pretty convincing?!
Are they being paid or otherwise incentivised to make these hyperbolic claims? To be fair they don't often look like vanilla LLM output but they do all have the same structure/patter to them.
We cannot with certainty assert that. If the datum is expected to be missing, such that the frame without the datum is still considered valid and must be handled rather than flagged as an error, the code has to do exactly that. Perhaps a missing value in the dictionary can be supplanted with a zero.
df['new_column'] = df.get('index_value', 0) + 1
# there might be no column ‘index_value’;
# requirements say that zero should be substituted.Edit: Changed 3.5 to 4.
Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.
> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.
For many users, this is much faster then trying to get another approximation
:,/^}/-d
Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.
I've been stung by them too many times.
The problem is the more I care about something, the less I'll agree with whatever the agent is trying to do.
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.
AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.
So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.
I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.
It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.
For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better. Offloading thinking actually makes my thinking process worse and thus slower.
Similar to moving from individual work to coordinating a large codebase: coding agents, human or otherwise, let you think at a higher abstraction level and tackle larger problems by taking care of the small details.
I wonder if a very lightweight RL loop built around the user could work well enough to help the situation. As I understand it, current LLMs generally do not learn at a rate such that one single bad RL example and one (prompted?) better example could result in improvement at anywhere near human speed.
That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.
Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.
AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.
"Reality has a surprising amount of detail" or something along those lines.
I put great effort into maintaining a markdown file with my world model (usecases x principles x requirements x ...) pertaining to the project, with every guardrail tightened as much as possible, and every ambiguity and interaction with the user or wider world explained. This situates the project in all applicable contexts. That 15k token file goes into every prompt.
What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.
From https://docs.github.com/en/copilot/concepts/prompting/prompt...:
Copilot Chat uses the chat history to get context about your request. To give Copilot only the relevant history:
- Use threads to start a new conversation for a new task
- Delete requests that are no longer relevant or that didn’t give you the desired result
For me, I get 1000x AI dev since my initial comparison point is very very low.
I created:
https://live.stingtao.info for me to do interactive live quiz (like Kahoot) https://presentation.stingtao,.info for me to make presentations by AI (like gamma) https://catsbook.stingtao.info for cats to register their own 'facebook' social network.
All will be improved along the way with AI.
I do feel that AI code assistants are helping people like me to become 1000x superman.
I do think AI code assistant is super great.
Recently, I use Open Codex 5.2 + Extra high reasoning model with $200 monthly subscription most and it's the best among all the other coding agents.
(I have subscribed to 4 at the same time and use all of them across a dozen of projects at the same time)
FrustratedMonky•19h ago
jnmandal•19h ago
Anyways, no issue. We'll just get claude to start answer stack overflow questions!