2025: The Year in LLMs

https://simonwillison.net/2025/Dec/31/the-year-in-llms/

940•simonw•1mo ago

Comments

AndyNemmity•1mo ago

These are excellent every year, thank you for all the wonderful work you do.

tkgally•1mo ago

Same here. Simon is one of the main reasons I’ve been able to (sort of) keep up with developments in AI.

I look forward to learning from his blog posts and HN comments in the year ahead, too.

password4321•1mo ago

Don't forget you can pay Simon to keep up with less!

> At the end of every month I send out a much shorter newsletter to anyone who sponsors me for $10 or more on GitHub

https://simonwillison.net/about/#monthly

waldrews•1mo ago

Remember, back in the day, when a year of progress was like, oh, they voted to add some syntactic sugar to Java...

throwup238•1mo ago

> they voted to add some syntactic sugar to Java...

I remember when we just wanted to rewrite everything in Rust.

Those were the simpler times, when crypto bros seemed like the worst venture capitalism could conjure.

OGEnthusiast•1mo ago

Crypto bros in hindsight were so much less dangerous than AI bros. At least they weren't trying to construct data centers in rural America or prop up artificial stocks like $NVDA.

quaintpartridge•1mo ago

They were, just not as many. https://www.wired.com/story/the-worlds-biggest-bitcoin-mine-...

SauntSolaire•1mo ago

Instead they were building crypto mining warehouses in rural America and propping up artificial currencies like BTC.

ryandrake•1mo ago

Crazy how the two most hyped and funded technologies of the decade were: energy wasting fake money for criminals and energy wasting plagiarism machines.

zahlman•1mo ago

Speaking of which, we never found out the details (strike price/expiration) of Michael Burry's puts, did we? It seems he could have made bank if he'd waited one more month...

kamranjon•1mo ago

I think they expire in March 2026 if the NVIDIA stock drops to $140 a share? Something close to that I think.

mgfist•1mo ago

It's funny how people complain about the rust belt dying and factories leaving rural communities and so on, then when someone wants to build something that can provide jobs and tax revenue, everyone complains.

lostlogin•1mo ago

I’ve heard about the risk of AI leading to job losses and wealth concentration.

I haven’t heard about new businesses, job creation and growth in former industrial towns. What have I missed?

jakeydus•1mo ago

How many people are employed at the average data center? A few dozen? Versus a steel mill, that’s nothing. A chicken plant in Nebraska closed down this last month. 3200 people lost their jobs. You think Meta will fill it with GPUs and the whole town will have jobs again?

scotty79•1mo ago

Many more are employed while building it. And they will never stop building. It's modern version of rail. But instead of distances it will cover the area.

uxcolumbo•1mo ago

Will local folks get those jobs to build the data center?

And if so, what happens to those builders once the data center is built?

scotty79•1mo ago

> Will local folks get those jobs to build the data center?

Yes. At some point the demand will be so high that imported workers won't suffice and local population will need to be trained and hired.

> And if so, what happens to those builders once the data center is built?

They are going to be moved to a new place where the datacenters will need to be built next. Mobility if the workforce was often cited as one of the greatest strengths of US economy.

uxcolumbo•1mo ago

So local people in town 1 who are getting these jobs to build the data center will then have to move to town 2 to build a data center there? What happens to the local people in town 2 who are also looking for construction jobs?

scotty79•1mo ago

Local people in town 2 share the same fate that people in town 1 alread had. If there's not enough imported workers, from town 1 or elsewere people from town 2 will need to be trained and employed.

More and more data centers (and power sources) are going to be built at the same time so more and more workers will be needed. This is going to be THE job. I think there are going to be many similarities with the age when railroads were being developed. Hopefully with less worker deaths this time.

techpression•1mo ago

As if any taxes will be paid to the areas affected, and add to that the billions in taxes used to subsidize everything before a single cent is a net positive.

nrhrjrjrjtntbt•1mo ago

More like 6 different new nosql databases and js frameworks.

dotancohen•1mo ago

A Wordpress zero day and Linux not on the desktop. Netcraft confirms it.

crystal_revenge•1mo ago

That must have been a long time back. Having lived through the time when web pages were served through CGI and mobile phones only existed in movies, when SVMs where the new hotness in ML and people would write about how weird NNs were, I feel like I've seen a lot more concrete progress in the last few decades than this year.

This year honestly feels quite stagnant. LLMs are literally technology that can only reproduce the past. They're cool, but they were way cooler 4 years ago. We've taken big ideas like "agents" and "reinforcement learning" and basically stripped them of all meaning in order to claim progress.

I mean, do you remember Geoffrey Hinton's RBM talk at Google in 2010? [0] That was absolutely insane for anyone keeping up with that field. By the mid-twenty teens RBMs were already outdated. I remember when everyone was implementing flavors of RNNs and LSTMs. Karpathy's character 2015 RNN project was insane [1].

This comment makes me wonder if part of the hype around LLMs is just that a lot of software people simply weren't paying attention to the absolutely mind-blowing progress we've seen in this field for the last 20 years. But even ignoring ML, the world's of web development and mobile application development have gone through incredible progress over the last decade and a half. I remember a time when JavaScript books would have a section warning that you should never use JS for anything critical to the application. Then there's the work in theorem provers over the last decade... If you remember when syntactic sugar was progress, either you remember way further back than I do, or you weren't paying attention to what was happening in the larger computing world.

0. https://www.youtube.com/watch?v=VdIURAu1-aU

1. https://karpathy.github.io/2015/05/21/rnn-effectiveness/

handoflixue•1mo ago

> LLMs are literally technology that can only reproduce the past.

Funny, I've used them to create my own personalized text editor, perfectly tailored to what I actually want. I'm pretty sure that didn't exist before.

It's wild to me how many people who talk about LLM apparently haven't learned how to use them for even very basic tasks like this! No wonder you think they're not that powerful, if you don't even know basic stuff like this. You really owe it to yourself to try them out.

crystal_revenge•1mo ago

> You really owe it to yourself to try them out.

I've worked at multiple AI startups in lead AI Engineering roles, both working on deploying user facing LLM products and working on the research end of LLMs. I've done collaborative projects and demos with a pretty wide range of big names in this space (but don't want to doxx myself too aggressively), have had my LLM work cited on HN multiple times, have LLM based github projects with hundreds of stars, appeared on a few podcasts talking about AI etc.

This gets to the point I was making. I'm starting to realize that part of the disconnect between my opinions on the state of the field and others is that many people haven't really been paying much attention.

I can see if recent LLMs are your first intro to the state of the field, it must feel incredible.

CamperBob2•1mo ago

That's all very impressive, to be sure. But are you sure you're getting the point? As of 2025, LLMs are now very good at writing new code, creating new imagery, and writing original text. They continue to improve at a remarkable rate. They are helping their users create things that didn't exist before. Additionally, they are now very good at searching and utilizing web resources that didn't exist at training time.

So it is absurdly incorrect to say "they can only reproduce the past." Only someone who hasn't been paying attention (as you put it) would say such a thing.

crystal_revenge•1mo ago

I think the confusion is people's misunderstanding of what 'new code' and 'new imagery' mean. Yes, LLMs can generate a specific CRUD webapp that hasn't existed before but only based on interpolating between the history of existing CRUD webapps. I mean traditional Markov Chains can also produce 'new' text in the sense that "this exact text" hasn't been seen before, but nobody would argue that traditional Markov Chains aren't constrained by "only producing the past".

This is even more clear in the case of diffusion models (which I personally love using, and have spent a lot of time researching). All of the "new" images created by even the most advanced diffusion models are fundamentally remixing past information. This is really obvious to anyone who has played around with these extensively because they really can't produce truly novel concepts. New concepts can be added by things like fine-tuning or use of LoRAs, but fundamentally you're still just remixing the past.

LLMs are always doing some form of interpolation between different points in the past. Yes they can create a "new" SQL query, but it's just remixing from the SQL queries that have existed prior. This still makes them very useful because a lot of engineering work, including writing a custom text editor, involve remixing existing engineering work. If you could have stack-overflowed your way to an answer in the past, an LLM will be much superior. In fact, the phrase "CRUD" largely exists to point out that most webapps are fundamentally the same.

A great example of this limitation in practice is the work that Terry Tao is doing with LLMs. One of the largest challenges in automated theorem proving is translating human proofs into the language of a theorem prover (often Lean these days). The challenge is that there is not very much Lean code currently available to LLMs (especially with the necessary context of the accompanying NL proof), so they struggle to correctly translate. Most of the research in this area is around improving LLM's representation of the mapping from human proofs to Lean proofs (btw, I personally feel like LLMs do have a reasonably good chance of providing major improvements in the space of formal theorem proving, in conjunction with languages like Lean, because the translation process is the biggest blocker to progress).

When you say:

> So it is absurdly incorrect to say "they can only reproduce the past."

It's pretty clear you don't have a solid background in generative models, because this is fundamentally what they do: model an existing probability distribution and draw samples from that. LLMs are doing this for a massive amount of human text, which is why they do produce some impressive and useful results, but this is also a fundamental limitation.

But a world where we used LLMs for the majority of work, would be a world with no fundamental breakthroughs. If you've read The Three Body Problem, it's very much like living in the world where scientific progress is impeded by sophons. In that world there is still some progress (especially with abundant energy), but it remains fundamentally and deeply limited.

throwaway7783•1mo ago

Would you say that LLMs can discover patterns hitherto unknown? It would still be generating from the past, but patterns/connections not made before.

PeterHolzwarth•1mo ago

Just an innocent bystander here, so forgive me, but I think the flack you are getting is because you appear to be responding to claims that these tools will reinvent everything and introduce a new halcyon age of creation - when, at least on hacker news, and definitely in this thread, no one is really making such claims.

Put another way, and I hate to throw in the now over-used phrase, but I feel you may be responding to a strawman that doesn't much appear in the article or the discussion here: "Because these tools don't achieve a god-like level of novel perfection that no one is really promising here, I dismiss all this sorta crap."

Especially when I think you are also admitting that the technology is a fairly useful tool on its own merits - a stance which I believe represents the bulk of the feelings that supporters of the tech here on HN are describing.

I apologize if you feel I am putting unrepresentative words in your mouth, but this is the reading I am taking away from your comments.

signatoremo•1mo ago

Lot of impressive points. They are also irrelevant. The majority of people also only extrapolate from the knowledge they acquired in the past. That’s why there is the concept of inventor, someone who comes up with new ideas. Many new inventions are also based on existing ideas. Is that the reason to dismiss those achievements?

Do you only take LLM seriously if it can be another Einstein?

> But a world where we used LLMs for the majority of work, would be a world with no fundamental breakthroughs.

What do you consider recent fundamental breakthroughs?

Even if you are right, human can continue to work on hard problems while letting LLM handle the majority of derivative work

uxcolumbo•1mo ago

How do human brains create something novel and what will it take for AIs to do the same?

threethirtytwo•1mo ago

> It's pretty clear you don't have a solid background in generative models, because this is fundamentally what they do

You don’t have a solid background. No one does. We fundamentally don’t understand LLMs, this is an industry and academic opinion. Sure there are high level perspectives and analogies we can apply to LLMs and machine learning in general like probability distributions, curve fitting or interpolations… but those explanations are so high level that they can essentially be applied to humans as well. At a lower level we cannot describe what’s going on. We have no idea how to reconstruct the logic of how an LLM arrived at a specific output from a specific input.

It is impossible to have any sort of deterministic function, process or anything produce new information from old information. This limitation is fundamental to logic and math and thus it will limit human output as well.

You can combine information you can transform information you can lose information. But producing new information from old information from deterministic intelligence is fundamentally impossible in reality and therefore fundamentally impossible for LLMs and humans. But note the keyword: “deterministic”

New information can literally only arise through stochastic processes. That’s all you have in reality. We know it’s stochastic because determinism vs. stochasticism are literally your only two viable options. You have a bunch of inputs, the outputs derived from it are either purely deterministic transformations or if you want some new stuff from the input you must apply randomness. That’s it.

That’s essentially what creativity is. There is literally no other logical way to generate “new information”. Purely random is never really useful so “useful information” arrives only after it is filtered and we use past information to filter the stochastic output and “select” something that’s not wildly random. We also only use randomness to perturb the output a little bit so it’s not too crazy.

In the end it’s this selection process and stochastic process combined that forms creativity. We know this is a general aspect of how creativity works because there’s literally no other way to do it.

LLMs do have stochastic aspects to them so we know for a fact it is generating new things and not just drawing on the past. We know it can fit our definition of “creative” and we can literally see it be creative in front of your eyes.

You’re ignoring what you see with your eyes and drawing your conclusions from a model of LLMs that isn’t fully accurate. Or you’re not fully tying the mechanisms of how LLMs work with what creativity or generating new data from past data is in actuality.

The fundamental limitation with LLMs is not that it can’t create new things. It’s that the context window is too small to create new things beyond that. Whatever it can create it is limited to the possibilities within that window and that sets a limitation on creativity.

What you see happening with LEAN can also be an issue with the context window being too small. If we have an LLM with a giant context window bigger than anything before… and pass it all the necessary data to “learn” and be “trained” on lean it can likely start to produce new theorems without literally being “trained”.

Actually I wouldn’t call this a “fundamental” problem. More fundamental is the aspect of hallucinations. The fact that LLMs produce new information from past information in the WRONG way. Literally making up bullshit out of thin air. It’s the opposite problem of what you’re describing. These things are too creative and making up too much stuff.

We have hints that LLMs know the difference between hallucinations and reality but coaxing it to communicate that differentiation to us is limited.

jheez3•1mo ago

"You don’t have a solid background.

If you want to go around huffing and puffing your chest about a subject area, you kinda do fella. Credibility.

threethirtytwo•1mo ago

Not only is what he saying in direct contradiction to what people with credibility have said, but his claimed credentials can be utter bullshit.

This is the internet bro. Credibility is irrelevant because identities can never be verified. So the only thing that matters is the strength and rationality of an argument.

That’s the point of hacker news substantive content not some battle of comparison of credentials or useless quips (like yours) with zero substance. Say something worth reading if you have anything to say at all, otherwise nobody cares.

oedemis•1mo ago

as architectures evolve, i think it can be that we learn more "side effects".. back in 2020 openai researchers said "GPT-3 is applied without any gradient updates or fine-tuning" the model emerges at a certain level of scale...

aoeusnth1•1mo ago

> It's pretty clear you don't have a solid background in generative models, because this is fundamentally what they do: model an existing probability distribution and draw samples from that.

After post-training, this is definitively NOT what an LLM does.

weatherlite•1mo ago

> So it is absurdly incorrect to say "they can only reproduce the past."

Also , a shitton of what we do economically is reproducing the past with slight tweaks and improvements. We all do very repetitive things and these tools cut the time / personnel needed by a significant factor.

windexh8er•1mo ago

> They are helping their users create things that didn't exist before.

That is a derived output. That isn't new as in: novel. It may be unique but it is derived from training data. LLMs legitimately cannot think and thus they cannot create in that way.

Kerrick•1mo ago

That is a pedantic distinction. You can create something that didn't exist by combining two things that did exist, in a way of combining things that already existed. For example, you could use a blender to combine almond butter and sawdust. While this may not be "novel", and it may be derived from existing materials and methods, you may still lay claim to having created something that didn't exist before.

For a more practical example, creating bindings from dynamic-language-A for a library in compiled-language-B is a genuinely useful task, allowing you to create things that didn't exist before. Those things are likely to unlock great happiness and/or productivity, even if they are derived from training data.

windexh8er•1mo ago

> That is a pedantic distinction. You can create something that didn't exist by combining two things that did exist, in a way of combining things that already existed.

This is the definition of a derived product. Call it a derivative work if we're being pedantic and, regardless, is not any level of proof that LLMs "think".

threethirtytwo•1mo ago

Pedantic and not true. The LLM has stochastic processes involved. Randomness. That’s not old information. That’s newly generated stuff.

zingar•1mo ago

Could you give us an idea of what you’re hoping for that is not possible to derive from training data of the entire internet and many (most?) published books?

techpression•1mo ago

This is the problem, the entire internet is a really bad set of training data because it’s extremely polluted.

Also the derived argument doesn’t really hold, just because you know about two things doesn’t mean you’d be able to come up with the third, it’s actually very hard most of the time and requires you to not do next token prediction.

threethirtytwo•1mo ago

The emergent phenomenon is that the LLM can separate truth from fiction when you give it a massive amount of data. It can figure the world out just as we can figure it out when we are as well inundated with bullshit data. The pathways exist in the LLM but it won’t necessarily reveal that to you unless you tune it with RL.

ahtihn•1mo ago

> The emergent phenomenon is that the LLM can separate truth from fiction when you give it a massive amount of data.

I don't believe they can. LLMs have no concept of truth.

What's likely is that the "truth" for many subjects is represented way more than fiction and when there is objective truth it's consistently represented in similar way. On the other hand there are many variations of "fiction" for the same subject.

threethirtytwo•1mo ago

They can and we have definitive proof. When we tune LLM models with reinforcement learning the models end up hallucinating less and becoming more reliable. Basically in a nut shell we reward the model when telling the truth and punish it when it’s not.

So think of it like this, to create the model we use terabytes of data. Then we do RL which is probably less than one percent of additional data involved in the initial training.

The change in the model is that reliability is increased and hallucinations are reduced at a far greater rate than one percent. So much so that modern models can be used for agentic tasks.

How can less than one percent of reinforcement training get the model to tell the truth greater than one percent of the time?

The answer is obvious. It ALREADY knew the truth. There’s no other logical way to explain this. The LLM in its original state just predicts text but it doesn’t care about truth or the kind of answer you want. With a little bit of reinforcement it suddenly does much better.

It’s not a perfect process and reinforcement learning often causes the model to be deceptive an not necessarily tell the truth but it more gives an answer that may seem like the truth or an answer that the trainer wants to hear. In general though we can measurably see a difference in truthfulness and reliability to an extent far greater than the data involved in training and that is logical proof it knows the difference.

Additionally while I say it knows the truth already this is likely more of a blurry line. Even humans don’t fully know the truth so my claim here is that an LLM knows the truth to a certain extent. It can be wildly off for certain things but in general it knows and this “knowing” has to be coaxed out of the model through RL.

Keep in mind the LLM is just auto trained on reams and reams of data. That training is massive. Reinforcement training is done on a human basis. A human must rate the answers so it is significantly less.

habinero•1mo ago

> The answer is obvious. It ALREADY knew the truth. There’s no other logical way to explain this.

I can think of several offhand.

1. The effect was never real, you've just convinced yourself it is because you want it to be, ie you Clever Hans'd yourself.

2. The effect is an artifact of how you measure "truth" and disappears outside that context ("It can be wildly off for certain things")

3. The effect was completely fabricated and is the result of fraud.

If you want to convince me that "I threatened a statistical model with a stick and it somehow got more accurate, therefore it's both intelligent and lying" is true, I need a lot less breathless overcredulity and a lot more "I have actively tried to disprove this result, here's what I found"

threethirtytwo•1mo ago

You asked for something concrete, so I’ll anchor every claim to either documented results or directly observable training mechanics.

First, the claim that RLHF materially reduces hallucinations and increases factual accuracy is not anecdotal. It shows up quantitatively in benchmarks designed to measure this exact thing, such as TruthfulQA, Natural Questions, and fact verification datasets like FEVER. Base models and RL-tuned models share the same architecture and almost identical weights, yet the RL-tuned versions score substantially higher. These benchmarks are external to the reward model and can be run independently.

Second, the reinforcement signal itself does not contain factual information. This is a property of how RLHF works. Human raters provide preference comparisons or scores, and the reward model outputs a single scalar. There are no facts, explanations, or world models being injected. From an information perspective, this signal has extremely low bandwidth compared to pretraining.

Third, the scale difference is documented by every group that has published training details. Pretraining consumes trillions of tokens. RLHF uses on the order of tens or hundreds of thousands of human judgments. Even generous estimates put it well under one percent of the total training signal. This is not controversial.

Fourth, the improvement generalizes beyond the reward distribution. RL-tuned models perform better on prompts, domains, and benchmarks that were not part of the preference data and are evaluated automatically rather than by humans. If this were a Clever Hans effect or evaluator bias, performance would collapse when the reward model is not in the loop. It does not.

Fifth, the gains are not confined to a single definition of “truth.” They appear simultaneously in question answering accuracy, contradiction detection, multi-step reasoning, tool use success, and agent task completion rates. These are different evaluation mechanisms. The only common factor is that the model must internally distinguish correct from incorrect world states.

Finally, reinforcement learning cannot plausibly inject new factual structure at scale. This follows from gradient dynamics. RLHF biases which internal activations are favored, it does not have the capacity to encode millions of correlated facts about the world when the signal itself contains none of that information. This is why the literature consistently frames RLHF as behavior shaping or alignment, not knowledge acquisition.

Given those facts, the conclusion is not rhetorical. If a tiny, low-bandwidth, non-factual signal produces large, general improvements in factual reliability, then the information enabling those improvements must already exist in the pretrained model. Reinforcement learning is selecting among latent representations, not creating them.

You can object to calling this “knowing the truth,” but that’s a semantic move, not a substantive one. A system that internally represents distinctions that reliably track true versus false statements across domains, and can be biased to express those distinctions more consistently, functionally encodes truth.

Your three alternatives don’t survive contact with this. Clever Hans fails because the effect generalizes. Measurement artifact fails because multiple independent metrics move together. Fraud fails because these results are reproduced across competing labs, companies, and open-source implementations.

If you think this is still wrong, the next step isn’t skepticism in the abstract. It’s to name a concrete alternative mechanism that is compatible with the documented training process and observed generalization. Without that, the position you’re defending isn’t cautious, it’s incoherent.

CamperBob2•1mo ago

He doesn't care. You might as well be arguing with a Scientologist.

threethirtytwo•1mo ago

I’ll give it a shot. He’s hiding behind that clever Hans story, thinking he’s above human delusion, but the reality is he’s the picture perfect example of how humans fool themselves. It’s so ironic.

jama211•1mo ago

Yeah you’ve lost me here I’m sorry. In the real world humans work with AI tools to create new things. What you’re saying is the equivalent of “when a human writes a book in English, because they use words and letters that already exist and they already know they aren’t creating anything new”.

nl•1mo ago

What does "think" mean?

Why is that kind of thinking required to create novel works?

Randomness can create novelty.

Mistakes can be novel.

There are many ways to create novelty.

Also I think you might not know how LLMs are trained to code. Pre-training gives them some idea of the syntax etc but that only gets you to fancy autocomplete.

Modern LLMs are heavily trained using reinforcement data which is custom task the labs pay people to do (or by distilling another LLM which has had the process performed on it).

closewith•1mo ago

By that definition, nearly all commercial software development (and nearly all human output in general) is derived output.

windexh8er•1mo ago

Wow.

You’re using ‘derived’ to imply ‘therefore equivalent.’ That’s a category error. A cookbook is derived from food culture. Does an LLM taste food? Can it think about how good that cookie tastes?

A flight simulator is derived from aerodynamics - yet it doesn’t fly.

Likewise, text that resembles reasoning isn’t the same thing as a system that has beliefs, intentions, or understanding. Humans do. LLMs don't.

Also... Ask an LLM what's the difference between a human brain and an LLM. If an LLM could "think" it wouldn't give you the answer it just did.

CamperBob2•1mo ago

Ask an LLM what's the difference between a human brain and an LLM. If an LLM could "think" it wouldn't give you the answer it just did.

I imagine that sounded more profound when you wrote it than it did just now, when I read it. Can you be a little more specific, with regard to what features you would expect to differ between LLM and human responses to such a question?

Right now, LLM system prompts are strongly geared towards not claiming that they are humans or simulations of humans. If your point is that a hypothetical "thinking" LLM would claim to be a human, that could certainly be arranged with an appropriate system prompt. You wouldn't know whether you were talking to an LLM or a human -- just as you don't now -- but nothing would be proved either way. That's ultimately why the Turing test is a poor metric.

windexh8er•1mo ago

> Right now, LLM system prompts are strongly geared towards not claiming that they are humans or simulations of humans. If your point is that a hypothetical "thinking" LLM would claim to be a human, that could certainly be arranged with an appropriate system prompt. You wouldn't know whether you were talking to an LLM or a human -- just as you don't now -- but nothing would be proved either way. That's ultimately why the Turing test is a poor metric.

The mental gymnastics here is entertainment at best. Of course the thinking LLM would give feedback on how it's actually just a pattern model over text - well, we shouldn't believe that! The LLM was trained to lie about it's true capabilities in your own admission?

How about these...

What observable capability would you expect from "true cognitive thought" that a next-token predictor couldn’t fake?

Where are the system’s goals coming from—does it originate them, or only reflect the user/prompt?

How does it know when it’s wrong without an external verifier? If the training data says X and the answer is Y - how will it ever know it was wrong and reach the correct conclusion?

CamperBob2•1mo ago

How does it know when it’s wrong without an external verifier? If the training data says X and the answer is Y - how will it ever know it was wrong and reach the correct conclusion?

You need to read a few papers with publication dates after 2023.

closewith•1mo ago

You’re arguing against a straw man. No one is claiming LLMs have beliefs, intentions, or understanding. They don’t need them to be economically useful.

windexh8er•1mo ago

Oh yes, they are.

And beyond people claiming that LLMs are basically sentient you have people like CamperBob2 who made this wild claim:

"""There's no such thing as people without language, except for infants and those who are so mentally incapacitated that the answer is self-evidently "No, they cannot."

Language is the substrate of reason. It doesn't need to be spoken or written, but it's a necessary and (as it turns out) sufficient component of thought."""

Let that sink. They literally think that there's no such thing as people without language. Talk about a wild and ignorant take on life in general!

CamperBob2•1mo ago

How'd they communicate with the test subjects?

That's "language."

ordersofmag•1mo ago

I will find this often-repeated argument compelling only when someone can prove to me that the human mind works in a way that isn't 'combining stuff it learned in the past'.

5 years ago a typical argument against AGI was that computers would never be able to think because "real thinking" involved mastery of language which was something clearly beyond what computers would ever be able to do. The implication was that there was some magic sauce that human brains had that couldn't be replicated in silicon (by us). That 'facility with language' argument has clearly fallen apart over the last 3 years and been replaced with what appears to be a different magic sauce comprised of the phrases 'not really thinking' and the whole 'just repeating what it's heard/parrot' argument.

I don't think LLM's think or will reach AGI through scaling and I'm skeptical we're particularly close to AGI in any form. But I feel like it's a matter of incremental steps. There isn't some magic chasm that needs to be crossed. When we get there I think we will look back and see that 'legitimately thinking' wasn't anything magic. We'll look at AGI and instead of saying "isn't it amazing computers can do this" we'll say "wow, was that all there is to thinking like a human".

windexh8er•1mo ago

> 5 years ago a typical argument against AGI was that computers would never be able to think because "real thinking" involved mastery of language which was something clearly beyond what computers would ever be able to do.

Mastery of words is thinking? In that line of argument then computers have been able to think for decades.

Humans don't think only in words. Our context, memory and thoughts are processed and occur in ways we don't understand, still.

There's a lot of great information out there describing this [0][1]. Continuing to believe these tools are thinking, however, is dangerous. I'd gather it has something to do with logic: you can't see the process and it's non-deterministic so it feels like thinking. ELIZA tricked people. LLMs are no different.

[0] https://archive.is/FM4y8 [0] https://www.theverge.com/ai-artificial-intelligence/827820/l... [1] https://www.raspberrypi.org/blog/secondary-school-maths-show...

CamperBob2•1mo ago

Mastery of words is thinking?

That's the crazy thing. Yes, in fact, it turns out that language encodes and embodies reasoning. All you have to do is pile up enough of it in a high-dimensional space, use gradient descent to model its original structure, and add some feedback in the form of RL. At that point, reasoning is just a database problem, which we currently attack with attention.

No one had the faintest clue. Even now, many people not only don't understand what just happened, but they don't think anything happened at all.

ELIZA, ROFL. How'd ELIZA do at the IMO last year?

meindnoch•1mo ago

So people without language cannot reason? I don't think so.

CamperBob2•1mo ago

There's no such thing as people without language, except for infants and those who are so mentally incapacitated that the answer is self-evidently "No, they cannot."

Language is the substrate of reason. It doesn't need to be spoken or written, but it's a necessary and (as it turns out) sufficient component of thought.

windexh8er•1mo ago

There are quite a few studies to refute this highly ignorant comment. I'd suggest some reading [0].

From the abstract: "Is thought possible without language? Individuals with global aphasia, who have almost no ability to understand or produce language, provide a powerful opportunity to find out. Astonishingly, despite their near-total loss of language, these individuals are nonetheless able to add and subtract, solve logic problems, think about another person’s thoughts, appreciate music, and successfully navigate their environments. Further, neuroimaging studies show that healthy adults strongly engage the brain’s language areas when they understand a sentence, but not when they perform other nonlinguistic tasks like arithmetic, storing information in working memory, inhibiting prepotent responses, or listening to music. Taken together, these two complementary lines of evidence provide a clear answer to the classic question: many aspects of thought engage distinct brain regions from, and do not depend on, language."

[0] https://pmc.ncbi.nlm.nih.gov/articles/PMC4874898/

CamperBob2•1mo ago

Yeah, you can prove pretty much anything with a pubmed link. Do dead salmon "think?" fMRI says maybe!

https://pmc.ncbi.nlm.nih.gov/articles/PMC2799957/

The resources that the brain is using to think -- whatever resources those are -- are language-based. Otherwise there would be no way to communicate with the test subjects. "Language" doesn't just imply written and spoken text, as these researchers seem to assume.

emp17344•1mo ago

There’s linguistic evidence that, while language influences thought, it does not determine thought - see the failure of the strong Sapir-Whorf hypothesis. This is one of the most widely studied and robust linguistic results - we actually know for a fact that language does not determine or define thought.

CamperBob2•1mo ago

How's the replication rate in that field? Last I heard it was below 50%.

How can you think without tokens of some sort? That's half of the question that has to be answered by the linguists. The other half is that if language isn't necessary for reasoning, what is?

We now know that a conceptually-simple machine absolutely can reason with nothing but language as inputs for pretraining and subsequent reinforcement. We didn't know that before. The linguists (and the fMRI soothsayers) predicted none of this.

emp17344•1mo ago

Read about linguistic history and make up your own mind, I guess. Or don’t, I don’t care. You’re dismissing a series of highly robust scientific results because they fail to validate your beliefs, which is highly irrational. I'm no longer interested in engaging with you.

CamperBob2•1mo ago

I've read plenty of linguistics work on a lay basis. It explains little and predicts even less, so it hasn't exactly encouraged me to delve further into the field. That said, linguistics really has nothing to do with arguments with the Moon-landing deniers in this thread, who are the people you should really be targeting with your advocacy of rationality.

In other words, when I (seem to) dismiss an entire field of study, it's because it doesn't work, not because it does work and I just don't like the results.

windexh8er•1mo ago

> ELIZA, ROFL. How'd ELIZA do at the IMO last year?

What's funny is the failure to grasp any contextual framing of ELIZA. When it came out people were impressed by it's reasoning, it's responses. And in your line of defense it could think because it had mastery of words!

But fast forward the current timeline 30 years. You will have been of the same camp that argued on behalf of ELIZA when the rest of the world was asking, confusingly: how did people think ChatGPT could think?

CamperBob2•1mo ago

No one was impressed with ELIZA's "reasoning" except for a few non-specialist test subjects recruited from the general population. Admittedly it was disturbing to see how strongly some of those people latched onto it.

Meanwhile, you didn't answer my question. How'd ELIZA do on the IMO? If you know a way to achieve gold-medal performance at top-level math and programming competitions without thinking, I for one am all ears.

svieira•1mo ago

Does a prolog program think?

CamperBob2•1mo ago

I don't know, you tell me. How'd your Prolog program do on the IMO problem set?

svieira•1mo ago

> Yes, in fact, it turns out that language encodes and embodies reasoning ... No one had the faintest clue

Funnily enough, they did, if you go back far enough. It's only the deconstructionists and the solipsists who had the audacity to think otherwise.

arcatech•1mo ago

> I will find this often-repeated argument compelling only when someone can prove to me that the human mind works in a way that isn't 'combining stuff it learned in the past'.

This is the definition of the word ‘novel’.

handoflixue•1mo ago

Seriously, all that familiarity and you think an LLM "literally" can't invent anything that didn't already exist?

Like, I'm sorry, but you're just flat-out wrong and I've got the proof sitting on my hard drive. I use this supposedly impossible program daily.

bigyabai•1mo ago

FWIW, your "evidence" is a text editor. I'm glad you made a tool that works for you, but the parent's point stands; this is a 200-level course-curriculum homework assignment. Tens of thousands of homemade editors exist, in various states of disrepair and vain overengineering.

least•1mo ago

The difference between those is the person is actually using this text editor that they built with the help of LLMs. There's plenty of people creating novel scripts and programs that can accommodate their own unique specifications.

If a programmer creating their own software (or contracting it out to a developer) would be a bespoke suit and using software someone or some company created without your input is an off the rack suit, I'd liken these sorts of programs as semi-bespoke, or made to measure.

"LLMs are literally technology that can only reproduce the past" feels like an odd statement. I think the point they're going for is that it's not thinking and so it's not going to produce new ideas like a human would? But literally no technology does that. That is all derived from some human beings being particularly clever.

LLMs are tools. They can enable a human to create new things because they are interfacing with a human to facilitate it. It's merging the functional knowledge and vision of a person and translating it into something else.

resize2996•1mo ago

compilers can only produce machine code. so unorginal.

windexh8er•1mo ago

Do you also think LLMs "think"?

From what you've described an LLM has not invented anything. LLMs that can reason have a bit more slight of hand but they're not coming up with new ideas outside of the bounds of what a lot of words have encompassed in both fiction and non.

Good for you that you've got a fun token of code that's what you've always wanted, I guess. But this type of fantasy take on LLMs seems to be more and more prevalent as of late. A lot of people defending LLMs as if they're owed something because they've built something or maybe people are getting more and more attached to them from the conversational angle. I'm not sure, but I've run across more people in 2025 that are way too far in the deep end of personifying their relationships with LLMs.

Kerrick•1mo ago

Hang on, you're now saying that if something has ever been described in fiction it doesn't count as invention? So if somebody literally developed a working photon torpedo, that isn't new because "Star Trek Did It"?

phatfish•1mo ago

Is there any danger an LLM is going to create a working photo torpedo?

ben_w•1mo ago

Well, they can use tools, and tools includes physics simulations, so if it is possible (and FWIW the tool-free "intuition" of ChatGPT is "there will never be an age of antimatter"), then why couldn't LLMs grind those tools to get a solution?

windexh8er•1mo ago

You seem to be pretty far down the rabbit hole. How about this... You task an LLM to create a photon torpedo. If it can truly think then it should be able to provide you with something tangible. When you've got that in hand let us all know.

Back to the land of reality... Describing something in fiction doesn’t magically make it "not an invention". Fiction can anticipate an idea, but invention is about producing a working, testable implementation and usually involves novel technical methods. "Star Trek did it" is at most prior art for the concept, not a blueprint for the mechanism. If you can't understand that differential then maybe go ask an LLM.

Kerrick•1mo ago

I didn't say anything about an LLM. I said "somebody" not "some predictive text engine."

ctxc•1mo ago

Some people cannot be convinced simply because their expectation of "novel" is something that appears in an Asimov novel.

I for one think your work is pretty cool - even though I haven't seen it, using something you built everyday is a claim not many can make!

9rx•1mo ago

When a computer is able to invent things, we’ve achieved AGI. Do you believe we are already in the AGI era, or is the inventor in this case actually you?

threethirtytwo•1mo ago

Over half of HN still thinks it’s a stochastic parrot and that it’s just a glorified google search.

The change hit us so fast a huge number of people don’t understand how capable it is yet.

Also it certainly doesn’t help that it still hallucinates. One mistake and it’s enough to set someone against LLMs. You really need to push through that hallucinations are just the weak part of the process to see the value.

CamperBob2•1mo ago

The problem I see, over and over, is that people pose poorly-formed questions to the free ChatGPT and Google models, laugh at the resulting half-baked answers that are often full of errors and hallucinations, and draw conclusions about the technology as a whole.

Either that, or they tried it "last year" or "a while back" and have no concept of how far things have gone in the meantime.

It's like they wandered into a machine shop, cut off a finger or two, and concluded that their grandpa's hammer and hacksaw were all anyone ever needed.

habinero•1mo ago

No, frankly it's the difference between actual engineers and hobbyists/amateurs/non-SWEs.

SWEs are trained to discard surface-level observations and be adversarial. You can't just look at the happy path, how does the system behave for edge cases? Where does it break down and how? What are the failure modes?

The actual analogy to a machine shop would be to look at whether the machines were adequate for their use case, the building had enough reliable power to run and if there were any safety issues.

It's easy to Clever Hans yourself and get snowed by what looks like sophisticated effort or flat out bullshit. I had to gently tell a junior engineer that just because the marketing claims something will work a certain way, that doesn't mean it will.

CamperBob2•1mo ago

You sound pretty certain. There's often good money to be made in taking the contrarian view, where you have insights that the so-called "smart money" lacks. What are some good investments to make in the extreme-bear case, in which we're all just Clever Hans-ing ourselves as you put it? Do you have skin in the game?

habinero•1mo ago

My dude, I assure you "humans are really good at convincing themselves of things that are not true" is a very, very well known fact. I don't know what kind of arbitrage you think exists in this incredibly anodyne statement lol.

If you want a financial tip, don't short stock and chase market butterflies. Instead, make real professional friends, develop real skills and learn to be friendly and useful.

I made my money in tech already, partially by being lucky and in the right place at the right time, and partially because I made my own luck by having friends who passed the opportunity along.

Hope that helps!

threethirtytwo•1mo ago

That answer is basically an admission that you don’t actually hold a strong contrarian belief about the technology at all.

The question wasn’t “are humans sometimes self-delusional?” Everyone agrees with that. The question was whether, in this specific case, the prevailing view about LLM capability is meaningfully wrong in a way that has implications. If you really believed this was mostly Clever Hans, there would be concrete consequences. Entire categories of investment, hiring, and product strategy would be mispriced.

Instead you retreated to “don’t short stocks” and generic career advice. That’s not skepticism, it’s risk-free agnosticism. You get to sound wise without committing to any falsifiable position.

Also, “I made my money already” doesn’t strengthen the argument. It sidesteps it. Being right once, or being lucky in a good cycle, doesn’t confer epistemic authority about a new technology. If anything, the whole point of contrarian insight is that it forces uncomfortable bets or at least uncomfortable predictions.

Engineers don’t evaluate systems by vibes or by motivational aphorisms. They ask: if this hypothesis is true, what would we expect to see? What would fail? What would be overhyped? What would not scale? You haven’t named any of that. You’ve just asserted that people fool themselves and stopped there.

threethirtytwo•1mo ago

What you’re describing is just competent engineering, and it’s already been applied to LLMs. People have been adversarial. That’s why we know so much about hallucinations, jailbreaks, distribution shift failures, and long-horizon breakdowns in the first place. If this were hobbyist awe, none of those benchmarks or red-teaming efforts would exist.

The key point you’re missing is the type of failure. Search systems fail by not retrieving. Parrots fail by repeating. LLMs fail by producing internally coherent but factually wrong world models. That failure mode only exists if the system is actually modeling and reasoning, imperfectly. You don’t get that behavior from lookup or regurgitation.

This shows up concretely in how errors scale. Ambiguity and multi-step inference increase hallucinations. Scaffolding, tools, and verification loops reduce them. Step-by-step reasoning helps. Grounding helps. None of that makes sense for a glorified Google search.

Hallucinations are a real weakness, but they’re not evidence of absence of capability. They’re evidence of an incomplete reasoning system operating without sufficient constraints. Engineers don’t dismiss CNC machines because they crash bits. They map the envelope and design around it. That’s what’s happening here.

Being skeptical of reliability in specific use cases is reasonable. Concluding from those failure modes that this is just Clever Hans is not adversarial engineering. It’s stopping one layer too early.

habinero•1mo ago

> If this were hobbyist awe, none of those benchmarks or red-teaming efforts would exist.

Absolutely not true. I cannot express how strongly this is not true, haha. The tech is neat, and plenty of real computer scientists work on it. That doesn't mean it's not wildly misunderstood by others.

> Concluding from those failure modes that this is just Clever Hans is not adversarial engineering.

I feel like you're maybe misunderstanding what I mean when I refer to Clever Hans. The Clever Hans story is not about the horse. It's about the people.

A lot of people -- including his owner-- were legitimately convinced that a horse could do math, because look, literally anyone can ask the horse questions and it answers them correctly. What more proof do you need? It's obvious he can do math.

Except of course it's not true lol. Horses are smart critters, but they absolutely cannot do arithmetic no matter how much you train them.

The relevant lesson here is it's very easy to convince yourself you saw something you 100% did not see. (It's why magic shows are fun.)

CamperBob2•1mo ago

Except of course it's not true lol. Horses are smart critters, but they absolutely cannot do arithmetic no matter how much you train them.

These things are not horses. How can anyone choose to remain so ignorant in the face of irrefutable evidence that they're wrong?

https://arxiv.org/abs/2507.15855

It's as if a disease like COVID swept through the population, and every human's IQ dropped 10 to 15 points while our machines grew smarter to an even larger degree.

habinero•1mo ago

Or -- and hear me out -- that result doesn't mean what you think it does.

That's the exact reason I mention the Clever Hans story. You think it's obvious because you can't come up with any other explanation, therefore there can't be another explanation and the horse must be able to do math. And if I can't come up with an explanation, well that just proves it, right? Those are the only two options, obviously.

Except no, all it means is you're the limiting factor. This isn't science 101 but maybe science 201?

My current hypothesis is the IMO thing gets trotted out mostly by people who aren't strong at math. They find the math inexplicable, therefore it's impressive, therefore machine thinky.

When you actually look hard at what's claimed in these papers -- and I've done this for a number of these self-published things -- the evidence frequently does not support the conclusions. Have you actually read the paper, or are you just waving it around?

At any rate, I'm not shocked that an LLM can cobble together what looks like a reasonable proof for some things sometimes, especially for the IMO which is not novel math and has a range of question difficulties. Proofs are pretty code-like and math itself is just a language for concisely expressing ideas.

Here, let me call a shot -- I bet this paper says LLMs fuck up on proofs like they fuck up on code. It will sometimes generate things that are fine, but it'll frequently generate things that are just irrational garbage.

threethirtytwo•1mo ago

On the IMO paper: pointing out that it’s not a gold medal or that some proofs are flawed is irrelevant to the claim being discussed, and you know it. The claim is not “LLMs are perfect mathematicians.” The claim is that they can produce nontrivial formal reasoning that passes external verification at a rate far above chance and far above parroting. Even a single verified solution falsifies the “just regurgitation” hypothesis, because no retrieval-only or surface-pattern system can reliably construct valid proofs under novel compositions.

Your fallback move here is rhetorical, not scientific: “maybe it doesn’t mean what you think it means.” Fine. Then name the mechanism. What specific process produces internally consistent multi-step proofs, respects formal constraints, generalizes across problem types, and fails in ways analogous to human reasoning errors, without representing the underlying structure? “People are impressed because they’re bad at math” is not a mechanism, it’s a tell.

Also, the “math is just a language” line cuts the wrong way. Yes, math is symbolic and code-like. That’s precisely why it’s such a strong test. Code-like domains have exact semantics. They are adversarial to bullshit. That’s why hallucinations show up so clearly there. The fact that LLMs sometimes succeed and sometimes fail is evidence of partial competence, not illusion. A parrot does not occasionally write correct code or proofs under distribution shift. It never does.

You keep asserting that others are being fooled, but you haven’t produced what science actually requires: an alternative explanation that accounts for the full observed behavior and survives tighter controls. Clever Hans had one. Stage magic has one. LLMs, so far, do not.

Skepticism is healthy. But repeating “you’re the limiting factor” while refusing to specify a falsifiable counter-hypothesis is not adversarial engineering. It’s just armchair disbelief dressed up as rigor. And engineers, as you surely know, eventually have to ship something more concrete than that.

CamperBob2•1mo ago

Have you actually read the paper, or are you just waving it around?

I've spent a lot of time feeding similar problems to various models to understand what they can and cannot do well at various stages of development. Reading papers is great, but by the time a paper comes out in this field, it's often obsolete. Witness how much mileage the ludds still get out of the METR study, which was conducted with a now-ancient Claude 3.x model that wasn't at the top of the field when it was new.

And the goalposts have now been moved to a dark corner of the parking garage down the street from the stadium. "This brand-new technology doesn't deliver infallible, godlike results out of the box, so it must just be fooling people." Or in equestrian parlance, "This talking horse told me to short NVDA. What a scam."

habinero•1mo ago

(Continuing from my other post)

The first thing I checked was "how did they verify the proofs were correct" and the answer was they got other AI people to check it, and those people said there were serious problems with the paper's methodology and it would not be a gold medal.

https://x.com/j_dekoninck/status/1947587647616004583

This is why we do not take things at face value.

CamperBob2•1mo ago

That tweet is aimed at Google. I don't know much about Google's effort at IMO, but OpenAI was the primary newsmaker in that event, and they reportedly did not use hints or external tools. If you have info to the contrary, please share it so I can update that particular belief.

Gemini 2.5 has since been superceded by 3.0, which is less likely to need hints. 2.5 was not as strong as the contemporary GPT model, but 3.0 with Pro Thinking mode enabled is up there with the best.

Finally, saying, "Well, they were given some hints" is like me saying, "LOL, big deal, I could drag a Tour peleton up Col du Galibier if I were on the same drugs Lance was using."

No, in fact I could do no such thing, drugs or no drugs. Similarly, a model that can't legitimately reason will not be able to solve these types of problems, even if given hints.

threethirtytwo•1mo ago

You’re leaning very hard on the Clever Hans story, but you’re still missing why the analogy fails in a way that should matter to an engineer.

Clever Hans was exposed because the effect disappeared under controlled conditions. Blind the observers, remove human cues, and the behavior vanished. The entire lesson of Clever Hans is not “people can fool themselves,” it’s “remove the hidden channel and see if the effect survives.” That test is exactly what has been done here, repeatedly.

LLM capability does not disappear when you remove human feedback. It does not disappear under automatic evaluation. It does not disappear across domains, prompts, or tasks the model was never trained or rewarded on. In fact, many of the strongest demonstrations people point to are ones where no human is in the loop at all: program synthesis benchmarks, math solvers, code execution tasks, multi-step planning with tool APIs, compiler error fixing, protocol following. These are not magic tricks performed for an audience. They are mechanically checkable outcomes.

Your framing quietly swaps “some people misunderstand the tech” for “therefore the tech itself is misunderstood in kind.” That’s a rhetorical move, not an argument. Yes, lots of people are confused. That has no bearing on whether the system internally models structure or just parrots. The horse didn’t suddenly keep solving arithmetic when the cues were removed. These systems do.

The “it’s about the people” point also cuts the wrong way. In Clever Hans, experts were convinced until adversarial controls were applied. With LLMs, the more adversarial the evaluation gets, the clearer the internal structure becomes. The failure modes sharpen. You start seeing confidence calibration errors, missing constraints, reasoning depth limits, and brittleness under distribution shift. Those are not illusions created by observers. They’re properties of the system under stress.

You’re also glossing over a key asymmetry. Hans never generalized. He didn’t get better at new tasks with minor scaffolding. He didn’t improve when the problem was decomposed. He didn’t degrade gracefully as difficulty increased. LLMs do all of these things, and in ways that correlate with architectural changes and training regimes. That’s not how self-deception looks. That’s how systems with internal representations behave.

I’ll be blunt but polite here: invoking Clever Hans at this stage is not adversarial rigor, it’s a reflex. It’s what you reach for when something feels too capable to be comfortable but you don’t have a concrete failure mechanism to point at. Engineers don’t stop at “people can be fooled.” They ask “what happens when I remove the channel that could be doing the fooling?” That experiment has already been run.

If your claim is “LLMs are unreliable for certain classes of problems,” that’s true and boring. If your claim is “this is all an illusion caused by human pattern-matching,” then you need to explain why the illusion survives automated checks, blind evaluation, distribution shift, and tool-mediated execution. Until then, the Hans analogy isn’t skeptical. It’s nostalgic.

jheez3•1mo ago

I wish there was a way to discern posts from legit clever people from the not-so.

Its annoying to see posts from people who lag behind in intelligence and just dont get it - people learn at different rates. Some see way further ahead.

threethirtytwo•1mo ago

A good way to filter is for you to look in the mirror. Only the person in the mirror sees further ahead than anyone else.

Greduan•1mo ago

Text editors in a thousand flavours has indeed already been programmed though. I don't think you understood what op meant.

Curious, does it perform at the limit of the hardware? Was it programmed in a tools language (like C++, Rust, C, etc.) or in a web tech?

zingar•1mo ago

What is the point that you believe would be demonstrated by a new text editor running at the limit of hardware in a compiled editor? Would that point apply to every other text editor that exists already?

fmbb•1mo ago

Is your new text editor open source?

nsxwolf•1mo ago

The LLM didn't invent any new technology to do that, though. You used the LLM to reorganize Lego building blocks of knowledge into something new.

Without you, there was nothing.

waldrews•1mo ago

I'm being hyperbolic of course, but I'm a little dismissive of the progress that happened since the days of BBS's and car based cell phones - we just got more connectivity, more capacity, more content, bigger/faster. Likewise, my attitude toward machine learning before 2023 is a smug 'heh, these computer scientists are doing undisciplined statistics at scale, how nice for them.' Then all of a sudden the machines woke up and started arguing with me, coherently, even about niche topics I have a PhD in. I can appreciate in retrospect how much of the machine learning progress ultimately went into that, but, like fusion, the magic payoff was supposed to be decades away and always remain decades away. This wasn't supposed to happen in my lifetime. 2025 progress isn't the 2023 shock, but this was the year LLM's-as-programmers (and LLM's-as-mathematicians, and...) went from 'isn't that cute, the machine is trying' to 'an expert with enough time would make better choices than the machine did,' and that makes for a different world. More so than, going from a Commodore Vic 20 with 4k of RAM and a modem to the latest Macbook.

ako•1mo ago

> This year honestly feels quite stagnant. LLMs are literally technology that can only reproduce the past.

Is this such a big limitation? Most jobs are basically people trained on past knowledge applying it today. No need to generate new knowledge.

And a lot of new knowledge is just combining 2 things from the past in a new way.

mr_toad•1mo ago

Most people are capable of long-term learning. Some people are capable of discovering and inventing new things. I think the two are related, and current NN architecture doesn’t allow this. An AI that can cobble together a CRUD application to spec is one thing. An AI that can come up with a new idea for a successful app on its own is a completely different ball game.

HarHarVeryFunny•1mo ago

> LLMs are literally technology that can only reproduce the past.

That's incorrect on many levels. They are drawing upon, and reproducing, language patterns from "the past", but they are combining those patterns in ways that may have never have been seen before. They may not be truly creative, but they are still capable of generating novel outputs.

> They're cool, but they were way cooler 4 years ago.

Maybe this year has been more about incremental progress with LLMs than the shock/coolness factor of talking to an LLM for the first time, but the utility of them, especially for programming, has dramatically increased this year, really in the last 6 months.

The improvement in "AI" image and video generation has also been impressive, to the point now that fake videos on YouTube can often only be identified as such by common sense rather that the fact that they don't look real.

Incremental improvement can often be more impressive that innovation, whose future importance can be hard to judge when it first appears. How many people read "Attention is all you need" in 2017 and thought "Wow! This is going to change the world!". Not even the authors of the paper thought that.

odiroot•1mo ago

I'm very relieved we've moved away from rewriting everything in Rust.

jll29•1mo ago

There's no reason not to use Rust for LLM-generated code in the longer term (other than lack of Rust code to learn from in the shorter term).

The stricter typing of Rust would make sematic errors in generated code come out more quickly than in e.g. Python because using static typing the chances are that some of the semantic errors are also type violations.

yencabulator•1mo ago

> (other than lack of Rust code to learn from in the shorter term)

FWIW Claude Code is quite good at writing Rust, and Claude and Gemini are both surprisingly good at explaining complex third-party Rust libraries.

michaelcampbell•1mo ago

Have we though? I'm glad we're not shouting about it from the rooftops like it's some magical "win" button as much, but TBH the things I use routinely that HAVE been rewritten in rust are generally much better. That could also just be because they're newer and have the errors of the past to not repeat.

sanreau•1mo ago

> Vendor-independent options include GitHub Copilot CLI, Amp, OpenHands CLI, and Pi

...and the best of them all, OpenCode[1] :)

[1]: https://opencode.ai

simonw•1mo ago

Good call, I'll add that. I think I mentally scrambled it with OpenHands.

the_mitsuhiko•1mo ago

Thanks for adding pi to it though :)

nineteen999•1mo ago

How did I miss this until now! Thank you for sharing.

logicprog•1mo ago

I don't know why you're downloaded, OpenCode is by far the best.

d4rkp4ttern•1mo ago

Can OpenCode be used with the Claude Max or ChatGPT Pro subscriptions, i.e., without per-token API charges?

simonw•1mo ago

Apparently it does work with Claude Max: https://opencode.ai/docs/providers/#anthropic

I don't see a similar option for ChatGPT Pro. Here's a closed issue: https://github.com/sst/opencode/issues/704

williamstein•1mo ago

There's a plugin that evidently supports ChatGPT Pro with Opencode: https://github.com/sst/opencode/issues/1686#issuecomment-349...

ewoodrich•1mo ago

Yes, I use it with a regular Claude Pro subscription. It also supports using GitHub Copilot subscriptions as a backend.

the_mitsuhiko•1mo ago

> The (only?) year of MCP

I like to believe, but MCP is quickly turning into an enterprise thing so I think it will stick around for good.

simonw•1mo ago

I think it will stick around, but I don't think it will have another year where it's the hot thing it was back in January through May.

Alex-Programs•1mo ago

I never quite got what was so "hot" about it. There seems to be an entire parallel ecosystem of corporates that are just begging to turn AI into PowerPoint slides so that they can mould it into a shape that's familiar.

9dev•1mo ago

One reason may be that it makes it a lot easier to open up a product to AI. Instead of adding a bad ChatGPT UI clone into your app, you inverse control and let external AI tools interact with your application and its data, thus giving your customers immediate benefits, while simultaneously sating your investors/founders/managers desire to somehow add AI.

nrhrjrjrjtntbt•1mo ago

MCP or skills? Can a skill negate the need for MCP. In addition there was a YC startup who is looking at searching docs for LLMs or similar. I think MCP may be less needed once you have skills, openapi specs, and other things that LLMs can call directly.

MitziMoto•1mo ago

MCP isn't going anywhere. Some developers can't seem to see past their terminal or dev environment when it comes to MCP. Skills, etc do not replace MCP and MCP is far more than just documentation searching.

MCP is a great way for an LLM to connect to an external system in a standardized way and immediately understand what tools it has available, when and how to use them, what their inputs and outputs are,etc.

For example, we built a custom MCP server for our CRM. Now our voice and chat agents that run on elevenlabs infrastructure can connect to our system with one endpoint, understand what actions it can take, and what information it needs to collect from the user to perform those actions.

I guess this could maybe be done with webhooks or an API spec with a well crafted prompt? Or if eleven labs provided an executable environment with tool calling? But at some point you're just reinventing a lot of the functionality you get for free from MCP, and all major LLMs seem to know how to use MCP already.

simonw•1mo ago

Yeah, I don't think I was particularly clear in that section.

I don't think MCP is going to go away, but I do think it's unlikely to ever achieve the level of excitement it had in early 2025 again.

If you're not building inside a code execution environment it's a very good option for plugging tools into LLMs, especially across different systems that support the same standard.

But code execution environments are so much more powerful and flexible!

I expect that once we come up with a robust, inexpensive way to run a little Bash environment - I'm still hoping WebAssembly gets us there - there will be much less reason to use MCP even outside of coding agent setups.

brabel•1mo ago

I disagree. MCP will remain the best way to do most things for the same reason REST APIs are the main way to access non local services: they provide a way to secure and audit access to systems in a way that a coding environment cannot. And you can authorize actions depending on the well defined inputs and outputs. You can’t do that using just a bash script unless said script actually does SSO and calls REST APIs but then you just have a worse MCP client without any interoperability.

the_mitsuhiko•1mo ago

I find it very hard to pick winners and losers in this environment where everything changes so quickly. Right now a lot of people are using bash as a glue environment for agents, even if they are not for developers.

cloudking•1mo ago

For connecting agents to third-party systems I prefer CLI tools, less context bloat and faster. You can define the CLI usage in your agent instructions. If the MCP you're using doesn't exist as a CLI, build one with your agent.

martinald•1mo ago

Totally agree - wrote this over the holidays which sums it all pretty well https://martinalderson.com/posts/why-im-building-my-own-clis...

npalli•1mo ago

Great summary of the year in LLMs. Is there a predictions (for 2026) blogpost as well?

simonw•1mo ago

Given how badly my 2025 predictions aged I'm probably going to sit that one out! https://simonwillison.net/2025/Jan/10/ai-predictions/

DANmode•1mo ago

Don’t be a bad sport, now!!

zahlman•1mo ago

Making predictions is useful even when they turn out very wrong. Consider also giving confidence levels, so that you can calibrate going forward.

jjude•1mo ago

I use predictions to prepare rather than to plan.

Planing depends on deterministic view of the future. I used to plan (esp annual plans) until about 5 years. Now I scan for trends and prepare myself for different scenarios that can come in the future. Even if you get it approximately right, you stand apart.

For tech trends, I read Simon, Benedict Evans, Mary Meeker etc. Simon is in a better position make these predictions than anyone else having closely analyzed these trends over the last few years.

Here I wrote about my approach: https://www.jjude.com/shape-the-future/

aussieguy1234•1mo ago

> The year of YOLO and the Normalization of Deviance #

On this including AI agents deleting home folders, I was able to run agents in Firejail by isolating vscode (Most of my agents are vscode based ones, like Kilo Code).

I wrote a little guide on how I did it https://softwareengineeringstandard.com/2025/12/15/ai-agents...

Took a bit of tweaking, vscode crashing a bunch of times with not being able to read its config files, but I got there in the end. Now it can only write to my projects folder. All of my projects are backed up in git.

NitpickLawyer•1mo ago

I have a bunch of tabs opened on this exact topic, so thank you for sharing. So far I've been using devcontainers w/ vscode, and mostly having a blast with it. It is a bit awkward since some extensions need to be installed in the remote env, but they seem to play nicely after you have it setup, and the keys and stuff get populated so things like kilocode, cline, roo work fine.

agentifysh•1mo ago

What an amazing progress in just short time. The future is bright! Happy New Year y'all!

sho_hn•1mo ago

Not in this review: Also the record year in intelligent systems aiding in and prompting human users into fatal self-harm.

Will 2026 fare better?

simonw•1mo ago

I really hope so.

The big labs are (mostly) investing a lot of resources into reducing the chance their models will trigger self-harm and AI psychosis and suchlike. See the GPT-4o retirement (and resulting backlash) for an example of that.

But the number of users is exploding too. If they make things 5x less likely to happen but sign up 10x more people it won't be good on that front.

Nuzzerino•1mo ago

How does a model “trigger” self-harm? Surely it doesn’t catalyze the dissatisfaction with the human condition, leading to it. There’s no reliable data that can drive meaningful improvement there, and so it is merely an appeasement op.

Same thing with “psychosis”, which is a manufactured moral panic crisis.

If the AI companies really wanted to reduce actual self harm and psychosis, maybe they’d stop prioritizing features that lead to mass unemployment for certain professions. One of the guys in the NYT article for AI psychosis had a successful career before the economy went to shit. The LLM didn’t create those conditions, bad policies did.

It’s time to stop parroting slurs like that.

falkensmaize•1mo ago

‘How does a model “trigger” self-harm?’

By telling paranoid schizophrenics that their mother is secretly plotting against them and telling suicidal teenagers that they shouldn’t discuss their plans with their parents. That behavior from a human being would likely result in jail time.

measurablefunc•1mo ago

The people working on this stuff have convinced themselves they're on a religious quest so it's not going to get better: https://x.com/RobertFreundLaw/status/2006111090539687956

andai•1mo ago

Also essential self-fulfilment.

But that one doesn't make headlines ;)

sho_hn•1mo ago

Sure -- but that's fair game in engineering. I work on cars. If we kill people with safety faults I expect it to make more headlines than all the fun roadtrips.

What I find interesting with chat bots is that they're "web apps" so to speak, but with safety engineering aspects that type of developer is typically not exposed to or familiar with.

simonw•1mo ago

One of the tough problems here is privacy. AI labs really don't want to be in the habit of actively monitoring people's conversations with their bots, but they also need to prevent bad situations from arising and getting worse.

walt_grata•1mo ago

Until AI labs have the equivalent of an SLA for giving accurate and helpful responses it don't get better. They've not even able to measure if the agents work correctly and consistently.

websiteapi•1mo ago

I'm curious how all of the progress will be seen if it does indeed result in mass unemployment (but not eradication) of professional software engineers.

simonw•1mo ago

I nearly added a section about that. I wanted to contrast the thing where many companies are reducing junior engineering hires with the thing where Cloudflare and Shopify are hiring 1,000+ interns. I ran out of time and hadn't figured out a good way to frame it though so I dropped it.

ori_b•1mo ago

My prediction: If we can successfully get rid of most software engineers, we can get rid of most knowledge work. Given the state of robotics, manual labor is likely to outlive intellectual labor.

beardedwizard•1mo ago

"Given the state of robotics" reminds me a lot of what was said about llms and image/video models over the past 3 years. Considering how much llms improved, how long can robotics be in this state?

I have to think 3 years from now we will be having the same conversation about robots doing real physical labor.

"This is the worst they will ever be" feels more apt.

chii•1mo ago

but robotics had the means to do majority of the physical labour already - it's just not worth the money to replace humans, as human labour is cheap (and flexible - more than robots).

With knowledge work being less high-paying, physical labour supply should increase as well, which drops their price. This means it's actually less likely that the advent of LLM will make physical labour more automated.

Davidzheng•1mo ago

Robotics is coming FAST. Faster than LLM progress in my opinion.

wh0knows•1mo ago

Curious if you have any links about the rapid progression of robotics (as someone who is not educated on the topic).

It was my feeling with robotics that the more challenging aspect will be making them economically viable rather than simply the challenge of the task itself.

beardedwizard•1mo ago

I mentioned military in my reply to the sibling comment - that is the most ready example. What anduril and others are doing today may be sloppy, but it's moving very quickly.

throw1235435•1mo ago

The question is how rapid the adoption is. The price of failure in the real world is much higher ($$$, environmental, physical risks) vs just "rebuild/regenerate" in the digital realm.

beardedwizard•1mo ago

Military adoption is probably a decent proxy indicator - and they are ready to hand the kill switch to autonomous robots

throw1235435•1mo ago

Maybe. There the cost of failure again is low. Its easier to destroy than to create. Economic disruption to workers will take a bit longer I think.

Don't get me wrong; I hope that we do see it in physical work as well. There is more value to society there; and consists of work that is risky and/or hard to do - and is usually needed (food, shelter, etc). It also means that the disruption is an "everyone" problem rather than something that just affects those "intellectual" types.

BobbyJo•1mo ago

I would have agreed with this a few months ago, but something Ive learned is that the ability to verify an LLMs output is paramount to its value. In software, you can review its output, add tests, on top of other adversarial techniques to verify the output immediately after generation.

With most other knowledge work, I don't think that is the case. Maybe actuarial or accounting work, but most knowledge work exists at a cross section of function and taste, and the latter isn't an automatically verifiable output.

throw1235435•1mo ago

I also believe this - I think it will probably just disrupt software engineering and any other digital medium with mass internet publication (i.e. things RLVR can use). For the short term future it seems to need a lot of data to train on, and no other profession has posted the same amount of verifiable material. The open source altruism has disrupted the profession in the end; just not in the way people first predicted. I don't think it will disrupt most knowledge work for a number of reasons. Most knowledge professions have "credentials' (i.e. gatekeeping) and they can see what is happening to SWE's and are acting accordingly. I'm hearing it firsthand at least locally in things like law, even accounting, etc. Society will ironically respect these professions more for doing so.

Any data, verifiability, rules of thumb, tests, etc are being kept secret. You pay for the result, but don't know the means.

coffeebeqn•1mo ago

I mean law and accounting usually have a “right” answer that you can verify against. I can see a test data set being built for most professions. I’m sure open source helps with programming data but I doubt that’s even the majority of their training. If you have a company like Google you could collect data on decades of software work in all its dimensions from your workforce

District5524•1mo ago

It's not about invalidating your conclusion, but I'm not so sure about law having a right answer. At a very basic level, like hypothetical conduct used in basic legal training matrerials or MCQs, or in criminal/civil code based situations in well-abstracting Roman law-based jurisdictions, definitely. But the actual work, at least for most lawyers is to build on many layers of such abstractions to support your/client's viwepoint. And that level is already about persuasion of other people, not having the "right" legal argument or applying the most correct case found. And this part is not documented well, approaches changes a lot, even if law remains the same. Think of family law or law of succession - does not change much over centuries but every day, worldwide, millions of people spend huge amounts of money and energy on finding novel ways to turn those same paragraphs to their advantage and put their "loved" ones and relatives in a worse position.

throw1235435•1mo ago

Not really. I used to think more general with the first generation of LLM's but given all progress since o1 is RL based I'm thinking most disruption will happen in open productive domains and not closed domains. Speaking to people in these professions they don't think SWE's have any self respect and so in your example of law:

* Context is debatable/result isn't always clear: The way to interpret that/argue your case is different (i.e. you are paying for a service, not a product)

* Access to vast training data: Its very unlikely that they will train you and give you data to their practice especially as they are already in a union like structure/accreditation. Its like paying for a binary (a non-decompilable one) without source code (the result) rather than the source and the validation the practitioner used to get there.

* Variability of real world actors: There will be novel interpretations that invalidate the previous one as new context comes along.

* Velocity vs ability to make judgement: As a lawyer I prefer to be paid higher for less velocity since it means less judgement/less liability/less risk overall for myself and the industry. Why would I change that even at an individual level? Less problem of the commons here.

* Tolerance to failure is low: You can't iterate, get feedback and try again until "the tests pass" in a court room unlike "code on a text file". You need to have the right argument the first time. AI/ML generally only works where the end cost of failure is low (i.e can try again and again to iron out error terms/hallucinations). Its also why I'm skeptical AI will do much in the real economy even with robots soon - failure has bigger consequences in the real world ($$$, lives, etc).

* Self employment: There is no tension between say Google shareholders and its employees as per your example - especially for professions where you must trade in your own name. Why would I disrupt myself? The cost I charge is my profit.

TL;DR: Gatekeeping, changing context, and arms race behavior between participants/clients. Unfortunately I do think software, art, videos, translation, etc are unique in that there's numerous examples online and has the property "if I don't like it just re-roll" -> to me RLVR isn't that efficient - it needs volumes of data to build its view. Software sadly for us SWE's is the perfect domain for this; and we as practitioners of it made it that way through things like open source, TDD, etc and giving it away free on public platforms in numerous quantities.

JumpCrisscross•1mo ago

> If we can successfully get rid of most software engineers, we can get rid of most knowledge work

Software, by its nature, is practically comprehensively digitized, both in its code history as well as requirements.

9dev•1mo ago

That’s the deep irony of technology IMHO, that innovation follows Conway's law on a meta layer: White collar workers inevitably shaped high technology after themselves, and instead of finally ridding humanity of hard physical labour—as was the promise of the Industrial Revolution—we imitate artists, scientists, and knowledge workers.

We can now use natural language to instruct computers generate stock photos and illustrations that would take a professional artist a few years ago, discover new molecule shapes, beat the best Go players, build the code for entire applications, or write documents of various shapes and lengths—but painting a wall? An unsurmountable task that requires a human to execute reliably, not even talking about economics.

Madmallard•1mo ago

Why would it?

The ability to accurately describe what you want with all constraints managed and with proactive design is the actual skill. Not programming. The day PMs can do that and have LLMs that can code to that, is the day software engineers en masse will disappear. But that day is likely never.

The non-technical people I've ever worked for were hopelessly terrible at attention to detail. They're hiring me primarily for that anyway.

legulere•1mo ago

Even if it will make software engineering drastically more productive, it’s questionable that this will lead to unemployment. Efficiency gains translate to lower prices. Sometimes this leads to very few additional demand, as can be seen with masses of typesetters that lost their jobs. Sometimes this leads to a dramatically higher demand like you can see in the classic Jevons paradox examples of coal and light bulbs. I highly suspect software falls in the latter category

kingstnap•1mo ago

Software demand is philosophically limited by the question of "What can your computer do for you?"

You can describe that somewhat formally as:

{What your computer can do} intersect {What you want done (consciously or otherwise)}

Well a computer can technically calculate any computuable task that fits in bounded memory, that is an enormous set so its real limitations are its interfaces. In which case it can send packets, make noises, and display images.

How many human desires are things that can be solved with making noises, displaying images, and sending packets? Turns out quite a few but its not everything.

Basically I'm saying we should hope more sorts of physical interfaces come around (like VR and Robotics) so we cover more human desires. Robotics is a really general physical interface (like how ip packets are an extremely general interface) so its pretty promising if it pans out.

Personally, I find it very hard to even articulate what desires I have. I have this feeling that I might be substantially happier if I was just sitting around a campfire eating food and chatting with people instead of enjoying whatever infinite stuff a super intelligent computer and robots could do for me. At least some of the time.

fullstackchris•1mo ago

This overly discussed thesis is already laughable - decent LLMs have been out for 3 years now and unemployment (using US as example) is up around 1% over the same time frame - and even attributing that small percentage change completely to AI is also laughable

DrewADesign•1mo ago

You’re absolutely right! You astutely observed that 2025 was a year with many LLMs and this was a selection of waypoints, summarized in a helpful timeline.

That’s what most non-tech-person’s year in LLMs looked like.

Hopefully 2026 will be the year where companies realize that implementing intrusive chatbots can’t make better ::waving hands:: ya know… UX or whatever.

For some reason, they think its helpful to distractingly pop up chat windows on their site because their customers need textual kindergarten handholding to … I don’t know… find the ideal pocket comb for their unique pocket/hair situation, or had an unlikely question about that aerosol pan release spray that a chatbot could actually answer. Well, my dog also thinks she’s helping me by attacking the vacuum when I’m trying to clean. Both ideas are equally valid.

And spending a bazillion dollars implementing it doesn’t mean your customers won’t hate it. And forcing your customers into pathways they hate because of your sunk costs mindset means it will never stop costing you more money than it makes.

I just hope companies start being honest with themselves about whether or not these things are good, bad, or absolutely abysmal for the customer experience and cut their losses when it makes sense.

Night_Thastus•1mo ago

They need to be intrusive and shoved in your face. This way, they can say they have a lot of people using them, which is a good and useful metric.

ronsor•1mo ago

> For some reason, they think its helpful to distractingly pop up chat windows on their site...

Companies have been doing this "live support" nonsense far longer than LLMs have been popular.

DrewADesign•1mo ago

There was also source point pollution before the Industrial Revolution. Useless, forced, irritating chat was ‘nowhere close’ to as aggressive or pervasive as it is now. It used to be a niche feature of some CRMs and now it’s everywhere.

I’m on LinkedIn Learning digging into something really technical and practical and it’s constantly pushing the chat fly out with useless pre-populated prompts like “what are the main takeaways from this video.” And they moved their main page search to a little icon on the title bar and sneakily now what used to be the obvious, primary central search field for years sends a prompt to their fucking chatbot.

zahlman•1mo ago

As much as I side with you on this one, I really don't think this submission is the right place to rant about it.

jennyholzer3•1mo ago

this thread is for pro-LLM propaganda only.

do not acknowledge that everyone in the world thinks this shit is a complete and total garbage fire

fantasizr•1mo ago

I took the good with the bad: the ai assisted coding tools are a multiplier, google ai overviews in search results are half baked (at best) and often just factually wrong. AI was put in the instagram search bar for no practical purpose etc.

DrewADesign•1mo ago

Yeah totally. The point I’m trying to make, however, is that most people don’t code, so they didn’t get the multiplier, and only got the mediocre-to-bad, with a handful of them doing things like generating dumb images for a boost. I think that’s why a lot of people in the software business are utterly bewildered when customers aren’t jumping for joy when they release a new AI “feature.” I think a lot of what gets classified as cynical ceo enshittification is really people ignoring basic good design practices, like making sure you’re effectively helping customers solve an actual problem in a context and with methods they, at least, don’t hate. Especially on the smaller scale, like indie app developers who probably get more out of AI than most, they really think people are going to like new AI features simply because they’re new AI features. They’re very wrong.

techpression•1mo ago

Nothing about the severe impact on the environment, and the hand waviness about water usage hurt to read. The referenced post was missing every single point about the issue by making it global instead of local. And as if data center buildouts are properly planned and dimensioned for existing infrastructure…

Add to this that all the hardware is already old and the amount of waste we’re producing right now is mind boggling, and for what, fun tools for the use of one?

I don’t live in the US, but the amount of tax money being siphoned to a few tech bros should have heads rolling and I really don’t want to see it happening in Europe.

But I guess we got a new version number on a few models and some blown up benchmarks so that’s good, oh and of course the svg images we will never use for anything.

simonw•1mo ago

"Nothing about the severe impact on the environment"

I literally said:

"AI data centers continue to burn vast amounts of energy and the arms race to build them continues to accelerate in a way that feels unsustainable."

AND I linked to my coverage from last year, which is still true today (hence why I felt no need to update it): https://simonwillison.net/2024/Dec/31/llms-in-2024/#the-envi...

jennyholzer3•1mo ago

Do you think anything should be done about this environmental impact?

Or should we just keep chugging along as though there is no problem at all?

simonw•1mo ago

I think we should continue to find ways to serve this stuff more efficiently - already a big focus of the AI labs because they like making money, and reduced energy bills = more profitable inference.

I also think we should use tax policy to provide financial incentives to reduce the environmental impact - tax breaks for renewables, tax hikes for fossil fuel powered data centers, that kind of thing.

smileson2•1mo ago

forgot to mention the first murder-suicide instigated by chatgpt

DANmode•1mo ago

These are his highlights as a killer blogger,

not AI’s highlights.

Easy with the hot take.

jennyholzer3•1mo ago

correction: the first murder-suicide instigated by chatgpt on record

didip•1mo ago

Indeed. I don't understand why Hacker News is so dismissive about the coming of LLMs, maybe HN readers are going through 5 stages of grief?

But LLM is certainly a game changer, I can see it delivering impact bigger than the internet itself. Both require a lot of investments.

cebert•1mo ago

Many people feel threatened by the rapid advancements in LLMs, fearing that their skills may become obsolete, and in turn act irrationally. To navigate this change effectively, we must keep open minds, keep adaptable, and embrace continuous learning.

nickphx•1mo ago

rapid advancements in what? hallucinations..? FOMO marketing? certainly nothing productive.

chii•1mo ago

> in turn act irrationally

it isn't irrational to act in self-interest. If LLM threatens someone's livelihood, it matters not that it helps humanity overall one bit - they will oppose it. I don't blame them. But i also hope that they cannot succeed in opposing it.

Davidzheng•1mo ago

It's irrational to genuinely hold false beliefs about capabilities of LLMs. But at this point I assume around half of the skeptics are emotionally motivated anyway.

jdhsgsvsbzbd•1mo ago

As opposed to having skin in the game for llms and are blind to their flaws???

I'd assume that around half of the optimists are emotionally motivated this way.

mrwrong•1mo ago

everybody is emotionally motivated, you included

rgoulter•1mo ago

Many comments discussing LLMs involve emotions, sure. :) Including, obviously, comments in favour of LLMs.

But most discussion I see is vague and without specificity and without nuance.

Recognising the shortcomings of LLMs makes comments praising LLMs that much more believable; and recognising the benefits of LLMs makes comments criticising LLMs more believable.

I'd completely believe anyone who says they've found the LLM very helpful at greenfield frontend tasks, and I'd believe someone who found the LLM unable to carry out subtle refactors on an old codebase in a language that's not Python or JavaScript.

reppap•1mo ago

I'm not threatened by LLMs taking my job as much as they are taking away my sanity. Every time I tell someone no and they come back to me with a "but copilot said.." it's followed by something entirely incorrect it makes me want to autodefenestrate.

callc•1mo ago

I am happy “autodefenestrate” is the first new word I learned in 2026. Thank you.

Autodefenestrate - To eject or hurl oneself from a window, especially lethally

snigsnog•1mo ago

The internet and smartphones were immediately useful in a million different ways for almost every person. AI is not even close to that level. Very to somewhat useful in some fields (like programming) but the average person will easily be able to go through their day without using AI.

The most wide-appeal possibility is people loving 100%-AI-slop entertainment like that AI Instagram Reels product. Maybe I'm just too disconnected with normies but I don't see this taking off. Fun as a novelty like those Ring cam vids but I would never spend all day watching AI generated media.

JumpCrisscross•1mo ago

> AI is not even close to that level

Kagi’s Research Assistant is pretty damn useful, particularly when I can have it poll different models. I remember when the first iPhone lacked copy-paste. This feels similar.

(And I don’t think we’re heading towards AGI.)

SgtBastard•1mo ago

… the internet was not immediately useful in a million different ways for almost every person.

Even if you skip ARPAnet, you’re forgetting the Gopher days and even if you jump straight to WWW+email==the internet, you’re forgetting the mosaic days.

The applications that became useful to the masses emerged a decade+ after the public internet and even then, it took 2+ decades to reach anything approaching saturation.

Your dismissal is not likely to age well, for similar reasons.

chii•1mo ago

the "usefulness" excuse is irrelevant, and the claim that phones/internet is "immediately useful" is just a post hoc rationalization. It's basically trying to find a reasonable reason why opposition to AI is valid, and is not in self-interest.

The opposition to AI is from people who feel threatened by it, because it either threatens their livelihood (or family/friends'), and that they feel they are unable to benefit from AI in the same way as they had internet/mobile phones.

duchef•1mo ago

The usefulness of mobile phones was identifiable immediately and it is absolutely not 'post hoc rationalization'. The issue was the cost - once low cost mobile telephones were produced they almost immediately became ubiquitous (see nokia share price from the release of the nokia 6110 onwards for example).

This barrier does not exist for current AI technologies which are being given away free. Minor thought experiment - just how radical would the uptake of mobile phones have been if they were given away free?

jfyi•1mo ago

It's only low cost for general usage chat users. If you are using it for anything beyond that, you are paying or sitting in a long queue (likely both).

You may just be a little early to the renaissance. What happens when the models we have today run on a mobile device?

The nokia 6110 was released 15 years after the first commercial cell phone.

duchef•1mo ago

Yes although even those people paying are likely still being subsidized and not currently paying the full cost.

Interesting thought about current SOTA models running on my mobile device. I've given it some thought and I don't think it would change my life in any way. Can you suggest some way that it would change yours?

jfyi•1mo ago

It will open access of llms to developers in the same way smart phones opened access to mobile general computing.

I really think most everyone misses the actual potential of llms. They aren't an app but an interface.

They are the new UI everyone has known they wanted going back as long as we've had computers. People wanted to talk to the computer and get results.

Think of the people already using them instead of search engines.

To me, and likely you, it doesn't add any value. I can get the same information at about the same speed as before with the same false positives to weed through.

To the person that couldn't use a search engine and filled the internet with easily answered questions before, it's a godsend. They can finally ask the internet in plain ole whatever language they use and get an answer. It can be hard to see, but this is the majority of people on this planet.

LLMs raise the floor of information access. When they become ubiquitous and basically free, people will forget they ever had to use a mouse or hunt for the right pixel to click a button on a tiny mobile device touch screen.

duchef•1mo ago

I think that's a nice reply and these products becoming the future of user computer interface is possible.

I can imagine them generating digital reality on the fly for users - no more dedicated applications, just pure creation on demand ('direct me via turn by turn 3d navigation to x then y and z', 'replay that goal that just was scored and overlay the 3 most recent similar goals scored like that in the bottom right corner of the screen', 'generate me a 3D adventure game to play in the style of zelda, but make it about gnomes').

I suspect the only limitation for a product like this is energy and compute.

qualifck•1mo ago

Eh, quite the contrary. A lot of anti AI people genuinely wanted to use AI but run into the factual reality of the limitations of the software. It's not that it's going to take my job, it's that I was told it would redefine how I do work and is exponentially improving only to find out that it just kind of sucks and hasn't gotten much better this year.

staticassertion•1mo ago

> Very to somewhat useful in some fields (like programming) but the average person will easily be able to go through their day without using AI.

I know a lot of "normal" people who have completely replaced their search engine with AI. It's increasingly a staple for people.

Smartphones were absolutely NOT immediately useful in a million different ways for almost every person, that's total revisionist history. I remember when the iPhone came out, it was AT&T only, it did almost nothing useful. Smartphones were a novelty for quite a while.

brabel•1mo ago

I agree with most points but as a tech enthusiast, I was using a smart phone years before the iPhone, and I could already use the internet, make video calls, email etc around 2005. It was a small flip phone but it was not uncommon for phones to do that already at that time, at least in Australia and parts of Asia (a Singaporean friend told me about the phone).

nen-nomad•1mo ago

ChatGPT has roughly 800 million weekly active users. Almost everyone around me uses it daily. I think you are underestimating the adoption.

throw1235435•1mo ago

How many pay? And out of that how many are willing to pay the amount to at least cover the inference costs (not loss leading?)

Outside the verifiable domains I think the impact is more assistance/augmentation than outright disruption (i.e. a novelty which is still nice). A little tiny bit of value sprinkled over a very large user base but each person deriving little value overall.

Even as they use it as search it is at best an incrementable improvement on what they used to do - not life changing.

danielbln•1mo ago

Even my mom and aunts are using it frequently for all sorts of things, and it took a long time for them to hop onto internet and smartphones at first.

mrweasel•1mo ago

The adoption is just so weird to me. I cannot for the life of me get LLM chatbot to work for me. Every time I try I get into an argument with the stupid thing. They are still wrong constantly, and when I'm wrong they won't correct me.

I have great faith in AI in e.g. medical equipment, or otherwise as something built in, working on a single problem in the background, but the chat interface is terrible.

dragonwriter•1mo ago

“Almost everyone will use it at free or effectively subsidized prices” and “It delivers utility which justifies its variable costs + fixed costs amortized over useful lifetime” are not the same thing, and its not clear how much of the use is tied to novelty such that if new and progressively more expensive to train releases at a regular cadence dropped off, usage, even at subsidized prices, would, too.

arctic-true•1mo ago

Usage plunges on the weekends and during the summer, suggesting that a significant portion of users are students using ChatGPT for free or at heavily subsidized rates to do homework (i.e., extremely basic work that is extraordinarily well-represented in the training data). That usage will almost certainly never be monetizable, and it suggests nothing about the trajectory of the technology’s capability or popularity. I suspect ChatGPT, in particular, will see its usage slip considerably as the education system (hopefully) adapts.

simonw•1mo ago

The summer slump was a thing in 2023 but apparently didn't repeat in 2024: https://www.similarweb.com/blog/insights/ai-news/chatgpt-bea...

The weekend slumps could equally suggest people are using it at work.

arctic-true•1mo ago

Interesting, thank you for that. I’d be curious to see the data for 2025. I was basing my take off Google trends data - the kind of person who goes to ChatGPT by googling “chatGPT” seems to be using it less in the summer.

raincole•1mo ago

The early internet and smartphones (the Japanese ones, not iPhone) were definitely not "immediately" adopted by the mass, unlike LLM.

If "immediate" usefulness is the metric we measure, then the internet and smartphones are pretty insignificant inventions compared to LLM.

(of course it's not a meaningful metric, as there is no clear line between a dumb phone and a smart phone, or a moderately sized language model and a LLM)

tim333•1mo ago

Yeah the internet kind of started with ARPANET in 1969 and didn't really get going with the public till around 1999 so thirty years on.

Here's a graph of internet takeoff with Krugman's famous quote of 1998 that it wouldn't amount to much being maybe the end of the skepticism https://www.contextualize.ai/mpereira/paul-krugmans-poor-pre...

In common with AI there was probably a long period when the hardware wasn't really good enough for it to be useful to most people. I remember 300 baud modems and rubber things to try to connect to your telephone handset back in the 80s.

jheez3•1mo ago

Thats all irrelevant. Is/was there tremendous value to be had by being able to transport data? Of course. No doubt about it. Everything else got figured out and investments were made because of that.

The same line of thinking does not hold with LLMs given their non-deterministic nature. Time will tell where things land.

tim333•1mo ago

There's value in intelligence too.

jheez3•1mo ago

Intelligence? No. Get the wording right. It’s driven by probability.

threethirtytwo•1mo ago

Calling intelligence “just probability” is like calling music “just vibrations” and thinking you said something deep.

fragmede•1mo ago

> The internet and smartphones were immediately useful in a million different ways for almost every person. AI is not even close to that level.

Those are some very rosy glasses you've got on there. The nascent Internet took forever to catch on. It was for weird nerds at universities and it'll never catch on, but here we are.

jheez3•1mo ago

Well until the weird nerds at uni created things like Google, Facebook and so on...

what-the-grump•1mo ago

A year after the iPhone came out… it didn’t have an App Store, barely was able to play video, barely had enough power to last a day. You just don’t remember or were not around for it.

A year after llms came out… are you kidding me?

Two years?

10 years?

Today, by adding an MCP server to wrap the same API that’s been around forever for some system, makes the users of that system prefer NLI over the gui almost immediately.

zvolsky•1mo ago

The idea of HN being dismissive of impactful technology is as old as HN. And indeed, the crowd often appears stuck in the past with hindsight. That said, HN discussions aren't homogeneous, and as demonstrated by Karpathy in his recent blogpost "Auto-grading decade-old Hacker News", at least some commenters have impressive foresight: https://karpathy.bearblog.dev/auto-grade-hn/

brabel•1mo ago

So exactly 10 years ago a lot of people believed that the game Go would not be “conquered” by AI, but after just a few months it was. People will always be skeptical of new things, even people who are in tech, because many hyped things indeed go nowhere… while it may look obvious in hindsight, it’s really hard to predict what will and what won’t be successful. On the LLM front I personally think it’s extremely foolish to still consider LLMs as going nowhere. There’s a lot more evidence today of the usefulness of LLMs than there was of DeepMind being able to beat top human players in Go 10 years ago.

crystal_revenge•1mo ago

> I don't understand why Hacker News is so dismissive about the coming of LLMs

I find LLMs incredibly useful, but if you were following along the last few years the promise was for “exponential progress” with a teaser world destroying super intelligence.

We objectively are not on that path. There is no “coming of LLMs”. We might get some incremental improvement, but we’re very clearly seeing sigmoid progress.

I can’t speak for everyone, but I’m tired of hyperbolic rants that are unquestionably not justified (the nice thing about exponential progress is you don’t need to argue about it)

aoeusnth1•1mo ago

We're very clearly seeing exponential progress - even above trend, on METR, whose slope keeps getting revised to a higher and higher estimate each time. Explain your perspective on the objective evidence against exponential progress?

llmslave2•1mo ago

Pretty neat how this exponential progress hasn't resulted in exponential productivity. Perhaps you could explain your perspective on that?

viraptor•1mo ago

Writing the code itself was never the main bottleneck. Designing the bigger solution, figuring out tradeoffs, taking to affected teams, etc. takes as much time as it used to. But still, there's definitely a significant improvement in code production part in many areas.

aoeusnth1•1mo ago

It has! CLs/engineer increased by 10% this year.

LLMs from late 2024 were nearly worthless as coding agents, so given they have quadrupled in capability since then (exponential growth, btw), it's not surprising to see a modestly positive impact on SWE work.

Also, I'm noticing you're not explaining yourself :)

llmslave2•1mo ago

Hey, I'm not the OG commentator, why do I have to explain myself! :)

When Fernando Alonso (best rookie btw) goes from 0-60 in 2.4 seconds in his Aston Martin, is it reasonable to assume he will near the speed of light in 20 seconds?

lopatin•1mo ago

> Hey, I'm not the OG commentator, why do I have to explain myself! :)

The issue is that you're not acknowledging or replying to people's explanations for _why_ they see this as exponential growth. It's almost as if you skimmed through the meat of the comment and then just re-phrased your original idea.

> When Fernando Alonso (best rookie btw) goes from 0-60 in 2.4 seconds in his Aston Martin, is it reasonable to assume he will near the speed of light in 20 seconds?

This comparison doesn't make sense because we know the limits of cars but we don't yet know the limits of LLMs. It's an open question. Whether or not an F1 engine can make it the speed of light in 20 seconds is not an open question.

llmslave2•1mo ago

It's not in me to somehow disprove claims of exponential growth when there isn't even evidence provided of it.

My point with the F1 comparison is to say that a short period of rapid improvement doesn't imply exponential growth and it's about as weird to expect that as it is for an f1 car to reach the speed of light. It's possible you know, the regulations are changing for next season - if Leclerc sets a new lap record in Australia by .1 ms we can just assume exponential improvements and surely Ferrari will be lapping the rest of the field by the summer right?

aoeusnth1•1mo ago

There is already evidence provided of it! METR time horizons is going up on an exponential trend. This is literally the most famous AI benchmark and already mentioned in this thread.

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-...

aoeusnth1•1mo ago

If you're not going to explain yourself, at least stay on topic. We're talking about exponential growth, so address the points I'm making.

aoeusnth1•1mo ago

I'm noticing you're not responding to my claim that producivity has been impacted

Madmallard•1mo ago

LLMs a year ago were more able to do a complex project I've repeatedly tried to do than they are now.

scotty79•1mo ago

Try Antigravity with Gemini 3 Pro. Seems very capable to me.

surajrmal•1mo ago

I think this is happening by raising the floor for job roles which are largely boilerplate work. If you are on the more skilled side or work in more original/ niche areas, AI doesn't really help too much. I've only been able to use AI effectively for scaling refactors, not really much in feature development. It often just slows me down when I try to use it. I don't see this changing any time soon.

HPMOR•1mo ago

I think this is an open question still and very interesting. Ilya discussed this on the Dwarkesh podcast. But the capabilities of LLMs is clearly exponential and perhaps super exponential. We went from something that could string together incoherent text in 2022 to general models helping people like Terrance Tao and Scott Aaronson write new research papers. LLMs also beat IMO and the ICPC. We have entered the John Henry era for intellectual tasks...

llmslave2•1mo ago

> But the capabilities of LLMs is clearly exponential and perhaps super exponential

By what metric?

utopiah•1mo ago

BS metric... /s

jennyholzer3•1mo ago

Chat GPT told him it was true

aspenmartin•1mo ago

- Scaling laws (Chinchilla type)

- METR task horizon

It's a mix, performance gains are bursty but we have been getting a lot of bursts (RLVR, test-time compute, agentic breakthroughs)

tsimionescu•1mo ago

> LLMs also beat IMO and the ICPC

Very spurious claims, given that there was no effort made to check whether the IMO or ICPC problems were in the training set or not, or to quantify how far problems in the training set were from the contest problems. IMO problems are supposed to be unique, but since it's not at the frontier of math research, there is no guarantee that the same problem, or something very similar, was not solved in some obscure manual.

mgfist•1mo ago

Because that requires adoption. Devs on hackernews are already the most up to date folks in the industry and even here adoption of LLMs is incredibly slow. And a lot of the adoption that does happen is still with older tech like ChatGPT or Cursor.

belmont_sup•1mo ago

What’s the newer tech?

TeodorDyakov•1mo ago

Claude Code With Opus 4.5

scotty79•1mo ago

How long before introduction of computers lead to increases in average productivity? How long for the internet? Business is just slow to figure out how to use anything for its benefit, but it eventually gets there.

fmbb•1mo ago

> How long before introduction of computers lead to increases in average productivity?

I think it never did. Still has not.

https://en.wikipedia.org/wiki/Productivity_paradox

spectralista•1mo ago

The best example is that even ATM machines didn't reduce bank teller jobs.

Why? Because even the bank teller is doing more than taking and depositing money.

IMO there is an ontological bias that pervades our modern society that confuses the map for the territory and has a highly distorted view of human existence through the lens of engineering.

We don't see anything in this time series, because this time series itself is meaningless nonsense that reflects exactly this special kind of ontological stupidity:

https://fred.stlouisfed.org/series/PRS85006092

As if the sum of human interaction in an economy is some kind of machine that we just need to engineer better parts for and then sum the outputs.

Any non-careerist, thinking person that studies economics would conclude we don't and will probably not have the tools to properly study this subject in our lifetimes. The high dimensional interaction of biology, entropy and time. We have nothing. The career economist is essentially forced to sing for their supper in a type of time series theater. Then there is the method acting of pretending to be surprised when some meaningless reductionist aspect of human interaction isn't reflected in the fake time series.

barrenko•1mo ago

Sir, we're in a modern economy, we don't ever ever look at productivity graphs (this is not to disparage LLMs, just a comment on productivity in general)

viraptor•1mo ago

> exponential progress

First you need to define what it means. What's the metric? Otherwise it's very much something you can argue about.

noodletheworld•1mo ago

> What's the metric?

Language model capability at generating text output.

The model progress this year has been a lot of:

- “We added multimodal”

- “We added a lot of non AI tooling” (ie agents)

- “We put more compute into inference” (ie thinking mode)

So yes, there is still rapid progress, but these ^ make it clear, at least to me, that next gen models are significantly harder to build.

Simultaneously we see a distinct narrowing between players (openai, deepseek, mistral, google, anthropic) in their offerings.

Thats usually a signal that the rate of progress is slowing.

Remind me what was so great about gpt 5? How about gpt4 from from gpt 3?

Do you even remember the releases? Yeah. I dont. I had to look it up.

Just another model with more or less the same capabilities.

“Mixed reception”

That is not what exponential progress looks like, by any measure.

The progress this year has been in the tooling around the models, smaller faster models with similar capabilities. Multimodal add ons that no one asked for, because its easier to add image and audio processing than improve text handling.

That may still be on a path to AGI, but it not an exponential path to it.

dragonwriter•1mo ago

> Language model capability at generating text output.

That's not a metric, that's a vague non-operationalized concept, that could be operationalized into an infinite number of different metrics. And an improvement that was linear in one of those possible metrics would be exponential in another one (well, actually, one that is was linear in one would also be linear in an infinite number of others, as well as being exponential in an infinite number of others.

That’s why you have to define an actual metric, not simply describe a vague concept of a kind of capacity of interest, before you can meaningfully discuss whether improvement is exponential. Because the answer is necessarily entirely dependent on the specific construction of the metric.

viraptor•1mo ago

> Language model capability at generating text output.

That's not a quantifiable sentence. Unless you put it in numbers, anyone can argue exponential/not.

> next gen models are significantly harder to build.

That's not how we judge capability progress though.

> Remind me what was so great about gpt 5? How about gpt4 from from gpt 3?

> Do you even remember the releases?

At gpt 3 level we could generate some reasonable code blocks / tiny features. (An example shown around at the time was "explain what this function does" for a "fib(n)") At gpt 4, we could build features and tiny apps. At gpt 5, you can often one-shot build whole apps from a vague description. The difference between them is massive for coding capabilities. Sorry, but if you can't remember that massive change... why are you making claims about the progress in capabilities?

> Multimodal add ons that no one asked for

Not only does multimodal input training improve the model overall, it's useful for (for example) feeding back screenshots during development.

aspenmartin•1mo ago

Exactly, gpt5 was unimpressive not because of its leap from GPT4 but because of expectations based on the string of releases since GPT4 (especially the reasoning models). The leap from 4->5 was actually massive.

threethirtytwo•1mo ago

I don’t think the path was ever exponential but your claim here is almost as if the slow down hit an asymptote like wall.

Most of the improvements are intangible. Can we truly say how much more reliable the models are? We barely have quantitative measurements on this so it’s all vibes and feels. We don’t even have a baseline metric for what AGI is and we invalidated the Turing test also based on vibes and feels.

So my argument is that part of the slow down is in itself an hallucination because the improvement is not actually measurable or definable outside of vibes.

aspenmartin•1mo ago

I kind of agree in principle but there are a multitude of clever benchmarks that try to measure lots of different aspects like robustness, knowledge, understanding, hallucinations, tool use effectiveness, coding performance, multimodal reasoning and generation, etc etc etc. all of these have lots of limitations but they all paint a pretty compelling picture that compliments the “vibes” which are also important.

aoeusnth1•1mo ago

> Language model capability at generating text output.

How would you put this on a graph?

aspenmartin•1mo ago

Next gen models are always hard to build, they are by definition pushing the frontier. Every generation of CPU was hard to build but we still had Moores law.

> Simultaneously we see a distinct narrowing between players (openai, deepseek, mistral, google, anthropic) in their offerings. Thats usually a signal that the rate of progress is slowing.

I agree with you on the fact in the first part but not the second part…why would convergence of performance indicate anything about the absolute performance improvements of frontier models?

> Remind me what was so great about gpt 5? How about gpt4 from from gpt 3? Do you even remember the releases? Yeah. I dont. I had to look it up.

3 -> 4 -> 5 were extraordinary leaps…not sure how one would be able to say anything else

> Just another model with more or less the same capabilities.

5 is absolutely not a model with more or less the same capabilities as gpt 4, what could you mean by this?

> “Mixed reception”

A mixed reception is an indication of model performance against a backdrop of market expectations, not against gpt 4…

> That is not what exponential progress looks like, by any measure.

Sure it is…exponential is a constant % improvement per year. We’re absolutely in that regime by a lot of measures

> The progress this year has been in the tooling around the models, smaller faster

Effective tool use is not somehow some trivial add on it is a core capability for which we are on an exponential progress curve.

> models with similar capabilities. Multimodal add ons that no one asked for, because its easier to add image and audio processing than improve text handling.

This is definitely a personal feeling of yours, multimodal models are not something no one asked for…they are absolutely essential. Text data is essential and data curation is non trivial and continually improving, we are also hitting the ceiling of internet text data. But yet we use an incredible amount of synthetic data for RL and this continues to grow……you guessed it, exponentially. and multimodal data is incredibly information rich. Adding multi modality lifts all boats and provides core capabilities necessary for open world reasoning and even better text data (e.g. understanding charts and image context for text).

noodletheworld•1mo ago

> exponential is a constant % improvement per year

I suppose of you pick a low enough exponent then the exp graph is flat for a long time and you're right, zero progress is “exponential” if you cherry pick your growth rate to be low enough.

Generally though, people understand “exponential growth” as “getting better/bigger faster and faster in an obvious way”

> 3 -> 4 -> 5 were extraordinary leaps…not sure how one would be able to say anything else

They objectively were not.

The metrics and reception to them was very clear and overwhelming.

Youre spitting some meaningless revisionist BS here.

Youre wrong.

Thats all there is to it.

aspenmartin•1mo ago

Doesn’t sound like you really seem to be interested in any sort of rational dialogue, metrics were “objectively” not better? What are you talking about of course they were have you even looked at benchmark progression for every benchmark we have?

You don’t understand what an exponential is or apparently what the benchmark numbers even are or possibly even how we actually measure model performance and the very real challenges and nuances involved but yet I’m “spitting some revisionist BS”. You have cited zero sources and are calling measured numbers “revisionist”.

You are also citing reception to models as some sort of indication of their performance, which is yet another confusing part of your reasoning.

I do agree that “metrics were were very clear” it just seems you don’t happen to understand what they are or what they mean.

scotty79•1mo ago

Define it however you like. There's not a single chart you can draw that even begins to look like a signoid.

nicbou•1mo ago

Time spent being human and enjoying life.

I can’t point at many problems it has meaningfully solved for me. I mean real problems , not tasks that I have to do for my employer. It seems like it just made parts of my existence more miserable, poisoned many of the things I love, and generally made the future feel a lot less certain.

scotty79•1mo ago

> but we’re very clearly seeing sigmoid progress.

Yeah, probably. But no chart actually shows it yet. For now we are firmly in exponential zone of the signoid curve and can't really tell if it's going to end in a year, decade or a century.

utopiah•1mo ago

Doesn't even matter if the goal is extremely high. Talking about exponential when we clearly see matching energy needs proves there is no way we can maintain that pace without radical (and thus unpredictable) improvements.

My own "feeling" is that it's definitely not exponential but again, doesn't matter if it's unsustainable.

fullstackchris•1mo ago

I wrote an article complaining about the whole hype over a year ago:

https://chrisfrewin.medium.com/why-llms-will-never-be-agi-70...

Seems to be playing out that way.

senordevnyc•1mo ago

I’ve been reading this comment multiple times a week for the last couple years. Constant assertions that we’re starting to hit limits, plateau, etc. But a cursory glance at where we are today vs a year ago, let alone two years ago, makes it wildly obvious that this is bullshit. The pace of improvement of both models and tooling has been breathtaking. I could give a shit whether you think it’s “exponential”, people like you were dismissing all of this years ago, meanwhile I just keep getting more and more productive.

qualifck•1mo ago

People keep saying stuff like this. That the improvements are so obvious and breathtaking and astronomical and then I go check out the frontier LLMs again and they're maybe a tiny bit better than they were last year but I can't actually be sure bcuz it's hard to tell.

sometimes it seems like people are just living in another timeline.

senordevnyc•1mo ago

I’m genuinely curious what your “checking the frontier LLMs” looks like, especially if you haven’t used AI since last year.

jennyholzer3•1mo ago

"maybe a tiny bit better" is what you say when you've been tricked by snake oil salesman

This shit has gotten worse since 2023.

aspenmartin•1mo ago

> This shit has gotten worse since 2023.

I would really appreciate it if people could be specific when they say stuff like this because it's so crazy out of line with all measurement efforts. There are an insane amount of serious problems with current LLM / agentic paradigms, but the idea that things have gotten worse since 2023? I mean come on.

senordevnyc•1mo ago

You’re responding to a troll who just has a nasty, bitter axe to grind against AI. It’s honestly pretty sad and pathetic.

aspenmartin•1mo ago

You might want to be more specific because benchmarks abound and they paint a pretty consistent picture. LMArena "vibes" paint another picture. I don't know what you are doing to "check" the frontier LLMs but whatever you're doing doesn't seem to match more careful measurement...

You don't actually have to take peoples word for it, read epoch.ai developments, look into the benchmark literature, look at ARC-AGI...

qualifck•1mo ago

That's half the problem though. I can see benchmarks. I can see number go up on some chart or that the AI scores higher on some niche math or programming test, but those results don't seem to actually connect much to meaningful improvements in daily usage of the software when those updates hit the public.

That's where the skepticism comes in, because one side of the discussion is hyping up exponential growth and the other is seeing something that looks more logarithmic instead.

I realize anecdotes aren't as useful as numbers for this kind of analysis, but there's such a wide gap between what people are observing in practice and what the tests and metrics are showing it's hard not to wonder about those numbers.

aspenmartin•1mo ago

I'm not sure I understand: we are _objectively on that path_ -- we are increasing exponentially on a number of metrics that may be imperfect but seem to paint a pretty consistent picture. Scaling laws are exponential. METR's time horizon benchmark is exponential. Lots of performance measures are exponential, so why do you say we're objectively not on that path?

> We might get some incremental improvement, but we’re very clearly seeing sigmoid progress.

again, if it is "very clear" can you point to some concrete examples to illustrate what you mean?

> I can’t speak for everyone, but I’m tired of hyperbolic rants that are unquestionably not justified (the nice thing about exponential progress is you don’t need to argue about it)

OK but what specifically do you have an issue with here?

tim333•1mo ago

>following along the last few years the promise was for “exponential progress”

I've been following for many years and the main exponential thing has been the Moore's law like growth in compute. Compute per dollar is probably the best tracking one and has done a steady doubling every couple of years or so for decades. It's exponential but quite a leisurely exponential.

The recent hype of the last couple of years is more dot com bubble like and going ahead of trend but will quite likely drop back.

Night_Thastus•1mo ago

LLMs hold some real utility. But that real utility is buried under a mountain of fake hype and over-promises to keep shareholder value high.

LLMs have real limitations that aren't going away any time soon - not until we move to a new technology fundamentally different and separate from them - sharing almost nothing in common. There's a lot of 'progress-washing' going on where people claim that these shortfalls will magically disappear if we throw enough data and compute at it when they clearly will not.

Gigachad•1mo ago

Pretty much. What actually exists is very impressive. But what was promised and marketed has not been delivered.

rustystump•1mo ago

Markets never deliver. That isnt new, i do think llms are not far off from google in terms of impact.

Search, as of today, is inferior to frontier models as a product. However, best case still misses expected returns by miles which is where the growsing comes from.

Generative art/ai is still up in the air for staying power but id predict it isnt going away.

visarga•1mo ago

I think the missing ingredient is not something the LLMs lack, but something we as developers don't do - we need to constrain, channel, and guide agents by creating reactive test environments around them. Not vibes, but hard tests, they are the missing ingredient to coding agents. You can even use AI to write most of these tests but the end result depends on how well you structured your code to be testable.

If you inherit 9000 tests from an existing project you can vibe code a replacement on your phone in a holiday, like Simon Willison's JustHTML port. We are moving from agents semi-randomly flailing around to constraint satisfaction.

coffeebeqn•1mo ago

Yes and most of the investment has been kind of post-GPT4 betting that things will get exponentially more impressive

baq•1mo ago

I find opus 4.5 and gpt 5.2 mind blowing more often than I find them dumb as rocks. I don’t listen to or read any marketing material, I just use the tools. I couldn’t care less about what the promises are, what I have now available to me is fundamentally different from what I had in August and it changed completely how I work.

probably_wrong•1mo ago

Speaking for myself: because if the hype were to be believed we should have no relational databases when there's MongoDB, no need for dollars when there's cryptocoins, all virtual goods would be exclusively sold as NFTs, and we would be all driving self-driving cars by now.

LLMs are being driven mostly by grifters trying to achieve a monopoly before they run out of cash. Under those conditions I find their promises hard to believe. I'll wait until they either go broke or stop losing money left and right, and whatever is left is probably actually useful.

simonw•1mo ago

The way I've been handling the deafening hype is to focus exclusively on what the models that we have right now can do.

You'll note I don't mention AGI or future model releases in my annual roundup at all. The closest I get to that is expressing doubt that the METR chart will continue at the same rate.

If you focus exclusively on what actually works the LLM space is a whole lot more interesting and less frustrating.

magicalhippo•1mo ago

> focus exclusively on what the models that we have right now can do

I'm just a casual user, but I've been doing the same and have noticed the sharp improvements of the models we have now vs a year ago. I have OpenAI Business subscription through work, I signed up for Gemini at home after Gemini 3, and I run local models on my GPU.

I just ask them various questions where I know the answer well, or I can easily verify. Rewrite some code, factual stuff etc. I compare and contrast by asking the same question to different models.

AGI? Hell no. Very useful for some things? Hell yes.

asielen•1mo ago

It is an over correction because of all the empty promises of LLMs. I use Claude and chatgpt daily at work and am amazed at what they can do and how far they can come.

BUT when I hear my executive team talk and see demos of "Agentforce" and every saas company becoming an AI company promising the world, I have to roll my eyes.

The challenge I have with LLMs is they are great at creating first draft shiny objects and the LLMs themselves over promise. I am handed half baked work created by non technical people that now I have to clean up. And they don't realize how much work it is to take something from a 60% solution to a 100% solution because it was so easy for them to get to the 60%.

Amazing, game changing tools in the right hands but also give people false confidence.

Not that they are not also useful for non-technical people but I have had to spend a ton of time explaining to copywriters on the marketing team that they shouldn't paste their credentials into the chat even if it tells them to and their vibe coded app is a security nightmare.

semilin•1mo ago

This seems like the right take. The claims of the imminence of AGI are exhausting and to me appear dissonant with reality. I've tried gemini-cli and Claude Code and while they're both genuinely quite impressive, they absolutely suffer from a kind of prototype syndrome. While I could learn to use these tools effectively for large-scale projects, I still at present feel more comfortable writing such things by hand.

The NVIDIA CEO says people should stop learning to code. Now if LLMs will really end up as reliable as compilers, such that they can write code that's better and faster than I can 99% of the time, then he might be right. As things stand now, that reality seems far-fetched. To claim that they're useless because this reality has not yet been achieved would be silly, but not more silly than claiming programming is a dead art.

vunderba•1mo ago

> I don't understand why Hacker News is so dismissive about the coming of LLMs.

Eh. I wouldn’t be so quick to speak for the entirety of HN. Several articles related to LLMs easily hit the front page every single day, so clearly there are plenty of HN users upvoting them.

I think you're just reading too much into what is more likely classic HN cynicism and/or fatigue.

ewoodrich•1mo ago

Exactly. There was a stretch of 6 months or so right after ChatGPT was released where approximately 50% of front page posts at any given time were related to LLMs. And these days every other Show HN is some kind of agentic dev tool and Anthropic/OpenAI announcements routinely get 500+ comments in a matter of hours.

utopiah•1mo ago

It's because both "side" tries to re-adjust.

When an "AI skeptic" sees a very positive AI comment, they try to argue that it is indeed interesting but nowhere near close to AI/AGI/ASI or whatever the hype at the moment uses.

When an "AI optimistic" sees a very negative AI comment, they try to list all the amazing things they have done that they were convinced was until then impossible.

viraptor•1mo ago

Based on quite a few comments recently, it also looks like many have tried LLMs in the past, but haven't seriously revisited either the modern or more expensive models. And I get it. Not everyone wants to keep up to date every month, or burn cash on experiments. But at the same time, people seem to have opinions formed in 2024. (Especially if they talk about just hallucinations and broken code - tell the agent to search for docs and fix stuff) I'd really like to give them Opus 4.5 as an agent to refresh their views. There's lots to complain about, but the world has moved on significantly.

mirsadm•1mo ago

This has been the argument since day one. You just have to try the latest model, that's where you went wrong. For the record I use Claude Code quite a bit and I can't see much meaningful improvements from the last few models. It is a useful tool but it's shortcomings are very obvious.

techpression•1mo ago

Just last week Opus 4.5 decided that the way to fix a test was to change the code so that everything else but the test broke.

When people say ”fix stuff” I always wonder if it actually means fix, or just make it look like it works (which is extremely common in software, LLM or not).

simonw•1mo ago

What did Opus do when you told it that it shouldn't have done that?

layer8•1mo ago

It apologized. ;)

viraptor•1mo ago

Sure, I get an occasional bad result from Opus - then I revert and try again, or ask it for a fix. Even with a couple of restarts, it's going to be faster than me on average. (And that's ignoring the situations where I have to restart myself)

Basically, you're saying it's not perfect. I don't think anyone is claiming otherwise.

b3kart•1mo ago

The problem is it’s imperfect in very unpredictable ways. Meaning you always need to keep it on a short leash for anything serious, which puts a limit on the productivity boost. And that’s fine, but does this match the level of investment and expectations?

techpression•1mo ago

It’s not about being perfect, it’s about not being as great as the marketing, and many proponents, claim.

The issue is that there’s no common definition of ”fixed”. ”Make it run no matter what” is a more apt description in my experience, which works to a point but then becomes very painful.

baq•1mo ago

Nice. Did it realize the mistake and corrected it?

techpression•1mo ago

Nope, I did get a lot of fancy markdown with emojis though so I guess that was a nice tradeoff.

In general, even with access to the entire code base (which is very small), I find the inherent need in the models to satisfy the prompter to be their biggest flaw since it tends to constantly lead down this path. I often have to correct over convoluted SQL too because my problems are simple and the training data seems to favor extremely advanced operations.

Madmallard•1mo ago

Have you tried using it for anything actually complicated?

Lol. It's worse than nothing at all.

lukaslalinsky•1mo ago

I think the split between vibe coding and AI-assisted coding will only widen over time. If you ask LLMs to do something complex, they will fail and you waste your time. If you work with them as a peer, and you delegate tasks to them, they will succeed and you save your time.

watwut•1mo ago

I work with leers by delegating complex task to them while I do other complex tasks.

hapticmonkey•1mo ago

It’s not the technology I’m dismissive about. It’s the economics.

25 years ago I was optimistic about the internet, web sites, video streaming, online social systems. All of that. Look at what we have now. It was a fun ride until it all ended up “enshitified”. And it will happen to LLMs, too. Fool me once.

Some developer tools might survive in a useful state on subscriptions. But soon enough the whole A.I. economy will centralise into 2 or 3 major players extracting more and more revenue over time until everyone is sick of them. In fact, this process seems to be happening at a pretty high speed.

Once the users are captured, they’ll orient the ad-spend market around themselves. And then they’ll start taking advantage of the advertisers.

I really hope it doesn’t turn out this way. But it’s hard to be optimistic.

Al-Khwarizmi•1mo ago

Contrary to the case for the internet, there is a way out, however - if local, open-source LLMs get good. I really hope they do, because enshittification does seem unavoidable if we depend on commercial offerings.

ndiddy•1mo ago

Well the "solution" for that will be the GPU vendors focusing solely on B2B sales because it's more profitable, therefore keeping GPUs out of the hands of average consumers. There's leaks suggesting that nVidia will gradually hike the prices of their 5090 cards from $2000 to $5000 due to RAM price increases ( https://wccftech.com/geforce-rtx-5090-prices-to-soar-to-5000... ). At that point, why even bother with the R&D for newer consumer cards when you know that barely anyone will be able to afford them?

tgv•1mo ago

The negatives outweigh the positives, if only because the positives are so small. A bunch of coders making their lives easier doesn't really matter, but pupils and students skipping education does. As a meme said: you had better start eating healthy, because your future doctor vibed his way through med school.

biscuit1v9•1mo ago

This. I don't know why this is not upvoted more.

Education part is on point and as a CS student that sees many of his colleagues using way too much the AI tools for instant homework solving without even processing the answers much.

phatfish•1mo ago

Maybe because the hype for an next gen search engine that can also just make things up when you query it is a bit much?

jcims•1mo ago

It feels like there are several conversations happening that sound the same but are actually quite different.

One of them is whether or not large models are useful and/or becoming more useful over time. (To me, clearly the answer is yes)

The other is whether or not they live up to the hype. (To me, clearly the answer is no)

There are other skirmishes around capability for novelty, their role in the economy, their impact on human cognition, if/when AGI might happen and the overall impact to the largely tech-oriented community on HN.

Atomic_Torrfisk•1mo ago

> HN readers are going through 5 stages of grief

So we are just irrational and sour?

claudiug•1mo ago

because lies. all the people involved in this, the one a C title, tell us about how great is now.

jheez3•1mo ago

"I can see it delivering impact bigger than the internet itself. Both require a lot of investments."

lol.... Just make sure you screenshot your post so you have a good reminder in a few years re. your predictive ability.

threethirtytwo•1mo ago

Predicting your own victory instead of defending it is a bold strategy.

syndacks•1mo ago

I can’t get over the range of sentiment on LLMs. HN leans snake oil, X leans “we’re all cooked” —- can it possibly be both? How do other folks make sense of this? I’m not asking for a side, rather understanding the range. Does the range lead you to believe X over Y?

zahlman•1mo ago

I'm not really convinced that anywhere leans heavily towards anything; it depends which thread you're in etc.

It's polarizing because it represents a more radical shift in expected workflows. Seeing that range of opinions doesn't really give me a reason to update, no. I'm evaluating based on what makes sense when I hear it.

thisoneisreal•1mo ago

My take (no more informed than anyone else's) is that the range indicates this is a complex phenomenon that people are still making sense of. My suspicion is that something like the following is going on:

1. LLMs can do some truly impressive things, like taking natural language instructions and producing compiling, functional code as output. This experience is what turns some people into cheerleaders.

2. Other engineers see that in real production systems, LLMs lack sufficient background / domain knowledge to effectively iterate. They also still produce output, but it's verbose and essentially missing the point of a desired change.

3. LLMs also can be used by people who are not knowledgeable to "fake it," and produce huge amounts of output that is basically besides-the-point bullshit. This makes those same senior folks very, very resentful, because it wastes a huge amount of their time. This isn't really the fault of the tool, but it's a common way the tool gets used and so it gets tarnished by association.

4. There is a ridiculous amount of complexity in some of these tools and workflows people are trying to invent, some of which is of questionable value. So aside from the tools themselves people are skeptical of the people trying to become thought leaders in this space and the sort of wild hacks they're coming up with.

5. There are real macro questions about whether these tools can be made economical to justify whatever value they do produce, and broader questions about their net impact on society.

6. Last but not least, these tools poke at the edges of "intelligence," the crown jewel of our species and also a big source of status for many people in the engineering community. It's natural that we're a little sensitive about the prospect of anything that might devalue or democratize the concept.

That's my take for what it's worth. It's a complex phenomenon that touches all of these threads, so not only do you see a bunch of different opinions, but the same person might feel bullish about one aspect and bearish about another.

johnfn•1mo ago

I believe the spikiness in response is because AI itself is spiky - it’s incredibly good at some classes of tasks, and remarkably poor at others. People who use it on the spikes are genuinely amazed because of how good it is. This does nothing but annoy the people who use it in the troughs, who become increasingly annoyed that everyone seems to be losing their mind over something that can’t even do (whatever).

llmslave2•1mo ago

Because there is a wide range of what people consider good. If you look at that the people on X consider to be good, it's not very surprising.

coffeefirst•1mo ago

Well, this is the internet. Arguing about everything is its favorite pastime.

But generally yes, I think back to Mongo/Node/metaverse/blockchain/IDEs/tablets and pretty much everything has had its boosters and skeptics, this is just more... intense.

Anyway I've decided to believe my own eyes. The crowds say a lot of things. You can try most of it yourself and see what it can and can't do. I make a point to compare notes with competent people who also spent the time trying things. What's interesting is most of their findings are compatible with mine, including for folks who don't work in tech.

Oh, and one thing is for sure: shoving this technology into every single application imaginable is a good way to lose friends and alienate users.

jheez3•1mo ago

Only those with great taste are well-equipped to make assertions about what we have infront of us.

The rest is all noise and personally I just block it out.

threethirtytwo•1mo ago

Then why are you still here?

nstart•1mo ago

The problem with X is that so many people who have no verifiable expertise are super loud in shouting "$INDUSTRY is cooked!!" every time a new model releases. It's exhausting and untrue. The kind of video generation we see might nail realism but if you want to use it to create something meaningful which involves solving a ton of problems and making difficult choices in order to express an idea, you run into the walls of easy work pretty quickly. It's insulting then for professionals to see manga PFPs on X put some slop together and say "movie industry is cooked!". It betrays a lack of understanding of what it takes to make something good and it gives off a vibe of "the loud ones are just trying to force this objectively meh-by-default thing to happen".

The other day there was that dude loudly arguing about some code they wrote/converted even after a woman with significant expertise in the topic pointed out their errors.

Gen AI has its promise. But when you look at the lack of ethics from the industry, the cacophony of voices of non experts screaming "this time it's really doom", and the weariness/wariness that set in during the crypto cycle, it's a natural tendency that people are going to call snake oil.

That said, I think the more accurate representation here is that HN as a whole is calling the hype snake oil. There's very little question anymore about the tools being capable of advanced things. But there is annoyance at proclamations of it being beyond what it really is at the moment which is that it's still at the stage of being an expertise+motivation multiplier for deterministic areas of work. It's not replacing that facet any time soon on its current trend (which could change wildly in 2026). Not until it starts training itself I think. Could be famous last words

senordevnyc•1mo ago

I’d put more faith in HN’s proclamations if it hadn’t widely been wrong about AI in 2023, 2024, and now 2025. Watching the tone shift here has been fascinating. As the saying goes, the only thing moving faster than AI advances right now is the speed at which HN haters move the goalposts…

habinero•1mo ago

Mmm. People who make AI their entire personality and brag that other people are too stupid to see what they see and soon they'll have to see the genius they're denying...does not make me think "oh, wow, what have I missed in AI".

3A2D50•1mo ago

AI has risen the barrier to all but the top and is threatening many peoples' livelihood. It has significantly increase the cost of computer hardware and is projected to increase the cost of electricity. I can definitely see why there is a tone shift! I'm still rooting for AI in general. Would love to see the end of a lot of diseases. I don't think we humans can cure all disease on our own in any of our lifetimes. Of course there all sorts of dystopian consequences that may derive from AI fully comprehending biology. I'm going to continue being naive and hope for the best!

Madmallard•1mo ago

I use them daily and I actively lose progress on complex problems and save time on simple problems.

PeterHolzwarth•1mo ago

I think it may be all summed up by Roy Amara's observation that "We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run."

ManuelKiessling•1mo ago

I think this is the most-fitting one-liner right now.

The arguments going back and forth in these threads are truly a sight to behold. I don’t want to lean to any one side, but in 2025 I‘ve begun to respond to everyone who still argues that LLMs are only plagiarism machines, or are only better autocompletes, or are only good at remixing the past: Yes, correct!

And CPUs can only move zeros and ones.

This is likewise a very true statement. But look where having 0s and 1s shuffled around has brought us.

The ripple effects of a machine doing something very simple and near-meaningless, but doing it at high speed and again and again without getting tired, cannot be underestimated.

At the same time, here is Nobel Laureate Robert Solow, who famously, and at the time correctly, stated that "You can see the computer age everywhere but in the productivity statistics."

It took a while, but eventually, his statement became false.

legulere•1mo ago

The effects might be drastically different from what you would expect though. We’ve seen this with machine learning/AI again and again that what looks probable to work doesn’t work out and unexpected things work.

xboxnolifes•1mo ago

From my perspective, both show HN and Twitter's normal biases. I view HN as generally leaning toward "new things suck, nothing ever changes", and I view Twitter generally as "Things suck, and everything is getting worse". Both of those align with snake oil and we're all cooked.

sanderjd•1mo ago

As usual, somewhere in between!

sph•1mo ago

Truth lies in the middle. Yes LLM are an incredible piece of technology, and yes we are cooked because once again technologists and VC have no idea nor interest in understanding the long-term societal ramifications of technology.

Now we are starting to agree that social media has had disastrous effects that have not fully manifested yet, and in the same breath we accept a piece of technology that promises to replace large parts of society with machines controlled by a few megacorps and we collectively shrug with “eh, we’re gonna be alright.” I mean, until recently the stated goal was to literally recreate advanced super-intelligence with the same nonchalance one releases a new JavaScript framework unto the world.

I find it utterly maddening how divorced STEM people have become from philosophical and ethical concerns of their work. I blame academia and the education system for creating this massive blind spot, and it is most apparent in echo chambers like HN that are mostly composed of Western-educated programmers with a degree in computer science. At least on X you get, among the lunatics, people that have read more than just books on algorithms and startups.

jheez3•1mo ago

"that have not fully manifested yet"

This is not true..

"I find it utterly maddening how divorced STEM people have become from philosophical and ethical concerns of their work. I blame academia and the education system for creating this massive blind spot, and it is most apparent in echo chambers like HN that are mostly composed of Western-educated programmers with a degree in computer science. At least on X you get, among the lunatics, people that have read more than just books on algorithms and startups."

Steve Jobs had something to say about this. Shame hes gone.

senordevnyc•1mo ago

Because it turns out that HN is mostly made up of cranky middle-aged conservatives (small c) who have largely defined themselves around coding, and AI is an existential threat to their core identity.

vanderZwan•1mo ago

Speaking of new year and AI: my phone just suggested "Happy Birthday!" as the quick-reply to any "Happy New Year!" notification I got in the last hours.

I'm not too worried about my job just yet.

pants2•1mo ago

It won't help to point out the worst examples. You're not competing with an outdated Apple LLM running on a phone. You're competing with Anthropic frontier models running on a multimillion dollar rack of servers.

vanderZwan•1mo ago

Sounds like I'm much more affordable with better ROI

gverrilla•1mo ago

This year I had a spotify and a youtube thing to "recall my year", and it was abolute garbage (30% truth, to be exact). I think they're doing it more like an exercise to build up systems, infra, processes, people, etc - it's already clear they don't actually care about users.

ogou•1mo ago

This is a good tooling survey of the past year. I have been watching it as a developer re-entering the job market. The job descriptions closely parallel the timeline used in the post. That's bizarre to me because these approaches are changing so fast. I see jobs for "Skill and Langchain experts with production-grade 0>1 experience. Former founders preferred". That is an expertise that is just a few months old and startups are trying to build whole teams overnight with it. I'm sure January and February will have job postings for whatever gets released that week. It's all so many sand castles.

weatherlite•1mo ago

> Skill and Langchain experts with production-grade 0>1 experience.

Also , it's just normal backend work - calling a bunch of APIs. What am I missing here?

walthamstow•1mo ago

Buzzwords.

XenophileJKO•1mo ago

That is like saying training tensorflow models is just calling some APIs.

Actually making a system like this work seems easy, but isn't really.

(Though with the CURRENT generation or two of models it has gotten "pretty easy" I think. Before that, not so much.)

weatherlite•1mo ago

No idea about training tenserflow models - is it super complex or is it just calling a couple of APIs ? Langchain is literally calling an API. Maybe you need to get good with prompting or whatever, but I don't see where the complexity lies. Please let me know.

andy99•1mo ago

Having used both Tensorflow (though I expect they mean PyTorch which is way more popular, and I have also used) and langchain, they are nothing alike.

They he ML frameworks are much closer to implementing the mathematics of neural networks, with some abstractions but much closer to the linear algebra level. It requires an understanding of the underlying theory.

Langchain is a suite of convenience functions for composing prompts to LLMs. I wouldn’t consider there to be some real domain knowledge one would need to use it. There is a learning curve but it’s about learning the different components rather than learning a whole new academic discipline.

HarHarVeryFunny•1mo ago

There's a big difference between building an ML framework like Tensorflow or PyTorch (I built a Lua Torch-like one in C++ myself) and just using it to build/train a model.

Building the model may range from very simple if you are just recreating a standard architecture, or be a research endeavor if you are designing something completely new.

The difficulty/complexity of then training the model depends on what it is. For something simple like a CNN for image recognition, it's really just a matter of selecting a few hyperparameters and letting it rip. At the other end of the spectrum you've got LLMs where training (and coping with instabilities) is something of a black art, with RL training completely different from pre-training, and there is also the issue of designing/discovering a pre/mid/post training curriculum.

But anyways, the actual training part can be very simple, not requiring too much knowledge of what's going on under the hood, depending on the model.

ogou•1mo ago

You're right, none of these new tools are disciplines. They are vendor specific approaches that are very recent. That's part of my overall point. Who is out there with 2+ years of very narrow tooling experience at another company at a senior level and is available for a rando startup (or desparate enterprise looking for bolt-on AI features) at a fraction of the pay? Not many, I'm sure. We can level up, do training, and maybe stand up a demo project. But that won't satisfy an ATS scan. It's unrealistic.

jennyholzer3•1mo ago

LLM addicts are cult members. Proficiency with buzz words is used to demonstrate status within the cult.

blutoot•1mo ago

I hope 2026 will be the year when software engineers and recruiters will stop the obsession with leetcode and all other forms of competitive programming bullshit

jennyholzer3•1mo ago

Thanks to Klaude Kode it'll be at least another 20 years of this.

If you don't make software developers prove their literacy you will get burned.

andrewinardeer•1mo ago

Thank you. Enjoyed this read.

AI slop videos will no doubt get longer and "more realistic" in 2026.

I really hope social media companies plaster a prominent banner over them which screams, "Likely/Made by AI" and give us the option to automatically mute these videos from our timeline. That would be the responsible thing to do. But I can't see Alphabet doing that on YT, xAI doing that on X or Meta doing that on FB/Insta as they all have skin in the video gen game.

sexy_seedbox•1mo ago

For image generation, it's already too realistic with Z-Image + Custom LoRas + SeedVR2 upscaling.

hooverd•1mo ago

I do think for the solution of say non-consensual pornography the only solution is incredible violence against people making it.

compass_copium•1mo ago

>I really hope social media companies plaster a prominent banner over them which screams, "Likely/Made by AI" and give us the option to automatically mute these videos from our timeline.

They should just be deleted. They will not be, because they clearly generate ad revenue.

cube00•1mo ago

> social media companies plaster a prominent banner over them

Not going to happen as the social media companies realise they can sell you the AI tools used to post slop back onto the platform.

compass_copium•1mo ago

>I’m still holding hope that slop won’t end up as bad a problem as many people fear.

That's the pure, uncut copium. Meanwhile, in the real world, search on major platforms is so slanted towards slop that people need to specify that they want actual human music:

https://old.reddit.com/r/MusicRecommendations/comments/1pq4f...

apolloartemis•1mo ago

Thank you for your warning about the normalization of deviance. Do you think there will be an AI agent software worm like NotPetya which will cause a lot of economic damage?

simonw•1mo ago

I'm expecting something like a malicious prompt injection which steals API keys and crypto wallets and uses additional tricks to spread itself further.

Or targeted prompt injections - like spear phishing attacks - against people with elevated privileges (think root sysadmins) who are known to be using coding agents.

lukaslalinsky•1mo ago

Speaking of asynchronous agents, what do people use? Claude Code for web is extremely limited, because you have no custom tools. Claude Code in GitHub Actions is vastly more useful, due to the custom environment, but ackward to use interactively. Are there any good alternatives?

simonw•1mo ago

I use Claude Code for web with an environment allowing full internet access, which means it can install extra tools as and when it needs them. I don't run into limits with it very often.

jimmySixDOF•1mo ago

Pretty sure next year's wrapup will have "Year of the sub-agent"

jes5199•1mo ago

I'm running Claude Code in a tmux on a VPS, and I'm working on setting up a meta-agent who can talk to me over text messages

absoluteunit1•1mo ago

Hey - this sounds like really interesting set-up!

Would you be open to providing more details. Would love to hear more, your workflows, etc.

fullstackchris•1mo ago

I just use a couple of custom MCP tools with the standard claude desktop app:

https://chrisfrew.in/blog/two-of-my-favorite-mcp-tools-i-use...

IMO this is the best balance of getting agentic work done while having immediate access to anything else you may need with your development process.

ehsanu1•1mo ago

What exactly do you mean by custom tools here? Just cli tools accessible to the agent?

lukaslalinsky•1mo ago

Development environment needed to build and test the project.

lopatin•1mo ago

The "pelicans on a bike" challenge is pretty wide spread now. Are we sure it's still not being trained on?

simonw•1mo ago

See https://simonwillison.net/2025/nov/13/training-for-pelicans-... (also in the pelicans section of the post).

lopatin•1mo ago

> All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle.

Razengan•1mo ago

My experience with AI so far: It's still far from "butler" level assistance for anything beyond simple tasks.

I posted about my failures to try to get them to review my bank statements [0] and generally got gaslit about how I was doing it wrong, that I if trust them to give them full access to my disk and terminal, they could do it better.

But I mean, at that point, it's still more "manual intelligence" than just telling someone what I want. A human could easily understand it, but AI still takes a lot of wrangling and you still need to think from the "AI's PoV" to get the good results.

[0] https://news.ycombinator.com/item?id=46374935

----

But enough whining. I want AI to get better so I can be lazier. After trying them for a while, one feature that I think all natural-language As need to have, would be the ability to mark certain sentences as "Do what I say" (aka Monkey's Paw) and "Do what I mean", like how you wrap phrases in quotes on Google etc to indicate a verbatim search.

So for example I could say "[[I was in Japan from the 5th to 10th]], identify foreign currency transactions on my statement with "POS" etc in the description" then the part in the [[]] (or whatever other marker) would be literal, exactly as written, but the rest of the text would be up to the AI's interpretation/inference so it would also search for ATM withdrawals etc.

Ideally, eventually we should be able to have multiple different AI "personas" akin to different members of household staff: your "chef" would know about your dietary preferences, your "maid" would operate your Roomba, take care of your laundry, your "accountant" would do accounty stuff.. and each of them would only learn about that specific domain of your life: the chef would pick up the times when you get hungry, but it won't know about your finances, and so on. The current "Projects" paradigm is not quite that yet.

ksec•1mo ago

All these improvement in a single year, 2025. While this may seem obvious to those who follows along the AI / LLM news. It may be worth pointing out again ChatGPT was introduced to us in November 2022.

I still dont believe AGI, ASI or Whatever AI will take over human in short period of time say 10 - 20 years. But it is hard to argue against the value of current AI, which many of the vocal critics on HN seems to have the opinion of. People are willing to pay $200 per month, and it is getting $1B dollar runway already.

Being more of a Hardware person, the most interesting part to me is the funding of all the developments of latest hardware. I know this is another topic HN hate because of the DRAM and NAND pricing issue. But it is exciting to see this from a long term view where the pricing are short term pain. Right now the industry is asking, we have together over a trillion dollar to spend on Capex over the next few years and will even borrow more if it needs to be, when can you ship us 16A / 14A / 10A and 8A or 5A, LPDDR6, Higher Capacity DRAM at lower power usage, better packaging, higher speed PCIe or a jump to optical interconnect? Every single part of the hardware stack are being fused with money and demand. The last time we have this was Post-PC / Smartphone era which drove the hardware industry forward for 10 - 15 years. The current AI can at least push hardware for another 5 - 6 years while pulling forward tech that was initially 8 - 10 years away.

I so wished I brought some Nvidia stock. Again, I guess no one knew AI would be as big as it is today, and it is only just started.

coffeebeqn•1mo ago

Seems like Nvidia will be focusing on the super beefy GPUs and leaving the consumer market to a smaller player

_s•1mo ago

AMD owns a lot of the consumer market already; handhelds, consoles, desktop rigs and mobile ... they are not a small player.

utopiah•1mo ago

They said "smaller" not small.

ac29•1mo ago

Intel's client computing revenue was greater than AMD's entire revenue last quarter

Flow•1mo ago

I don't get why Nvidia can't do both? Is it because of the limited production capabilities of the factories?

ACCount37•1mo ago

Yes. If you're bottlenecked on silicon and secondaries like memory, why would you want to put more of those resources into lower margin consumer products if you could use those very resources to make and sell more high margin AI accelerators instead?

From a business standpoint, it makes some sense to throttle the gaming supply some. Not to the point of surrendering the market to someone else probably, but to a measurable degree.

ksec•1mo ago

We will have to wait and see but my bet is that Nvidia will move to Leading Edge node N2 earlier now they have the Margin to work with. Both Hopper and Blackwell were too late in the design cycle. The AI hype and continue to buy the latest and great leaving Gaming at a mainstream node.

Nvidia using Mainstream node has always been the norm considering most Fab capacity always goes to Mobile SoC first. But I expect the internet / gamers will be angry anyway because Nvidia does not provide them with the latest and greatest.

In reality the extra R&D cost for designing with leading edge will be amortised by all the AI order which give Nvidia competitive advantage at the consumer level when they compete. That is assuming there are competition because most recent data have shown Nvidia owning 90%+ of discreet market share, 9% for AMD and 1% for Intel.

utopiah•1mo ago

> All these improvement in a single year

> hard to argue against the value of current AI

> People are willing to pay $200 per month, and it is getting $1B dollar runway already.

Those are 3 different things. There can be a LOT of fast and significant improvements but still remain extremely far from the actual goal, so far it looks like actually little progress.

People pay for a lot of things, including snake oil, so convincing a lot of people to pay a bit is not in itself a proof of value, especially when some people are basically coerced into this, see how many companies changed their "strategy" to mandating AI usage internally, or integration for a captive audience e.g. Copilot.

Finally yes, $1B is a LOT of money for you and I... but for the largest corporations it's actually not a lot. For reference Google earned that in revenue... per day in 2023. Anyway that's still a big number BUT it still has to be compared with, well how much does OpenAI burn. I don't have any public number on that but I believe the consensus is that it's a lot. So until we know that number we can't talk about an actual runway.

aspenmartin•1mo ago

> People pay for a lot of things, including snake oil, so convincing a lot of people to pay a bit is not in itself a proof of value

But do you really believe e.g. Claude code is snake oil? I pay $200 / month for Claude, which is something I would have thought monumentally insane maybe 1-2 years ago (e.g. when ChatGPT came out with their premium subscription price I thought that seemed so out of touch). I don't think we would be seeing the subscription rates and the retention numbers if it really was snake oil.

> Finally yes, $1B is a LOT of money for you and I... but for the largest corporations it's actually not a lot. For reference Google earned that in revenue... per day in 2023. Anyway that's still a big number BUT it still has to be compared with, well how much does OpenAI burn. I don't have any public number on that but I believe the consensus is that it's a lot. So until we know that number we can't talk about an actual runway.

this gets brought up a lot but I'm not sure I understand why folks on a forum called YCombinator, a startup accelerator, would make this sound like an obvious sign of charlatanism; operating at a loss is nothing new and anthropic / openAI strategy seems perfectly rational: they are scaling and capturing market share, and TAM is insane.

jimmaswell•1mo ago

> many companies changed their "strategy" to mandating AI usage internally

Are they hiring? My job is still dragging its feet on approving copilot.

chias•1mo ago

These are not all improvements. Listed:

* The year of YOLO and the Normalization of Deviance

* The year that Llama lost its way

* The year of alarmingly AI-enabled browsers

* The year of the lethal trifecta

* The year of slop

* The year that data centers got extremely unpopular

steveBK123•1mo ago

> * The year that data centers got extremely unpopular

I was discussing the political angle with a friend recently. I think Big Tech Bro / VC complex has done themselves a big disservice by aligning so tightly with MAGA to the point AI will be a political issue in 2026 & 2028.

Think about the message they’ve inadvertently created themselves - AI is going to replace jobs, it’s pushing electric prices up, we need the government to bail us out AND give us a regulatory light touch.

Super easy campaign for Dems - big tech trumpers are taking your money, your jobs, causing inflation, and now they want bailouts !!

mbesto•1mo ago

Said differently - the year we start to see all of the externalities of a globally scaled hyped tech trend.

Y_Y•1mo ago

Not that YOLO, PJ Reddie released that in 2015

jillesvangurp•1mo ago

2025 was the year of development tool using AI agents. I think we'll shift attention to non development tool using AI agents. Most business users are still stuck using chat gpt as some kind of grand oracle that will write their email or powerpoint slides. There are bits and pieces of mostly technology demo level solutions but nothing that is widely used like AI coding tools are so far. I don't think this is bottle necked on model quality.

I don't need an AGI. I do need a secretary type agent that deals with all the simple but yet laborious non technical tasks that keep infringing on my quality engineering time. I'm CTO for a small startup and the amount of non technical bullshit that I need to deal with is enormous. Some examples of random crap I deal with: figuring out contracts, their meaning/implication to situations, and deciding on a course of action; Customer offers, price calculations, scraping invoices from emails and online SAAS accounts, formulating detailed replies to customer requests, HR legal work, corporate bureaucracy, financial planning, etc.

A lot of this stuff can be AI assisted (and we get a lot of value out of ai tools for this) but context engineering is taking up a non trivial amount of my time. Also most tools are completely useless at modifying structured documents. Refactoring a big code base, no problem. Adding structured text to an existing structured document, hardest thing ever. The state of the art here is an ff-ing sidebar that will suggest you a markdown formatted text that you might copy/paste. Tool quality is very primitive. And then you find yourself just stripping all formatting and reformatting it manually. Because the tools really suck at this.

arcatech•1mo ago

> Some examples of random crap I deal with: figuring out contracts, their meaning/implication to situations, and deciding on a course of action

This doesn’t sound like bullshit you should hand off to an AI. It sounds like stuff you would care about.

nrclark•1mo ago

Agree. Even asking it can anchor your thinking.

jillesvangurp•1mo ago

I do care about it; kind of my duty as a co-founder. Which is why I'm spending double digit percentages of my time doing this stuff. But I absolutely could use some tools to cut down on a lot of the drudgery that is involved with this. And me reading through 40 pages of dense legal German isn't one of my strengths since I 1) do not speak German 2) am not a lawyer and 3) am not necessarily deeply familiar with all the bureaucracy, laws, etc.

But I can ask intelligent questions about that contract from an LLM (in English) and shoot back and forth a few things, come up with some kind of action plan, and then run it by our laywers and other advisors.

That's not some kind of hypothetical thing. That's something that happened multiple times in our company in the last few months. LLMs are very empowering for dealing with this sort of thing. You still need experts for some stuff. But you can do a lot more yourself now. And as we've found out, some of the "experts" that we relied on in the past actually did a pretty shoddy job. A lot of this stuff was about picking apart the mess they made and fixing it.

As soon as you start drafting contracts, it gets a lot harder. I just went through a process like that as well. It involves a lot of manual work that is basically about formatting documents, drafting text, running pdfs and text snippets through chat gpt for feedback, sparring, criticism, etc. and iterating on that. This is not about vibe coding some contract but making sure every letter of a contract is right. That ultimately involves lawyers and negotiating with other stakeholders but it helps if you come prepared with a more or less ready to sign off on document.

It's not about handing stuff off but about making LLMs work for you. Just like with coding tools. I care about code quality as well. But I still use the tools to save me a lot of time.

simonw•1mo ago

One of the lessons I learned running a startup is that it doesn't matter how good the professionals you hire are for things like legal and accounting, you still need to put work in yourself.

Everyone makes mistakes and misses things, and as the co-founder you have to care more about the details than anyone else does.

I would have loved to have weird-unreliable-paralegal-Claude available back when I was doing that!

topaztee•1mo ago

`Also most tools are completely useless at modifying structured documents`

we built a tool for this for the life science space and are opening it up to the general public very soon. Email me I can give you access (topaz at vespper dot com)

jennyholzer3•1mo ago

you don't need AGI, you need human labor

pjc50•1mo ago

Investing a trillion dollars for a revenue of a billion dollars doesn't sound great yet.

steveBK123•1mo ago

Indeed, its the old Uber playbook at nearly two extra orders of magnitude.

It is a large enough number to simply run out of private capital to consume before it turns cash flow positive.

Lots of things sell well if sold at such a loss. I’d take a new Ferrari for $2500 if it was on offer.

derwiki•1mo ago

Uber’s playbook worked for Uber

aoeusnth1•1mo ago

You say that as if Uber's playbook didn't work. Try this: https://www.google.com/finance/quote/UBER:NYSE

pjc50•1mo ago

Did Uber actually do a lot of capital investment? They don't own the cars, for example.

simonw•1mo ago

I believe they spent a huge amount of money on incentives to help sign up drivers, and discounts to help attract customers.

pjc50•1mo ago

Yes, but that's loss leader rather than capital investment. You can't put a customer on the balance sheet and depreciate them. Once you've paid for a free ride, you own nothing tangible.

shimman•1mo ago

Uber nakedly broke the law and beat down labor, I'm honestly shocked none of the executives went to prison.

derektank•1mo ago

Uber didn’t beat down labor, they beat down capital, specifically the capital that owned (and lobbied for the existence of) taxi medallions

shimman•1mo ago

No, you clearly have never talked to the workers at Uber (no not the devs, the drivers). Uber has disgustingly fought against unionization efforts, employee benefit efforts, and increasing wages. Such actions do not make you a good company, especially when the executives make millions while fighting against workers wanting a better life.

They are an evil company and the rot has been there since inception. This isn't even getting into their disgusting internal predator culture against women either.

signatoremo•1mo ago

Companies benefiting from trillion dollars spent during the dotcom era certainly make more than a billion dollars, for the last 20 years.

Intellectual dishonesty is certainly rampant on HN.

ACCount37•1mo ago

Is the AI progress in 2025 an outstanding breakthrough? Not really. It's impressive but incremental.

Still, the gap between the capabilities of a cutting edge LLM and that of a human is only this wide. There are only this many increments it takes to cross it.

wpietri•1mo ago

This is not a great argument:

> But it is hard to argue against the value of current AI [...] it is getting $1B dollar runway already.

The psychic services industry makes over $2 billion a year in the US [1], with about a quarter of the population being actual believers. [2].

[1] The https://www.ibisworld.com/united-states/industry/psychic-ser...

[2] https://news.gallup.com/poll/692738/paranormal-phenomena-met...

apexalpha•1mo ago

What if these provide actual value through placebo-effect?

recursive•1mo ago

You talking about psychics or LLMs?

grosswait•1mo ago

Yes

wpietri•1mo ago

I think we have different definitions of "actual value". But even if I pick the flaccid definition, that isn't proof of value of the thing itself, but of any placebo. In which case we can focus on the cheapest/least harmful placebo. Or, better, solving the underlying problem that the placebo "helps".

computably•1mo ago

I'll preface by saying I fully agree that psychics aren't providing any non-placebo value to believers, although I think it's fine to provide entertainment for non-believers.

> Or, better, solving the underlying problem that the placebo "helps".

The underlying problems are often a lack of a decent education and a generally difficult/unsatisfying life. Systemic issues which can't be meaningfully "solved" without massive resources and political will.

jay_kyburz•1mo ago

Actually, I'd go one step further and say they are harmful to everybody else.

It might just be my circles, but I've seen Carl Sagans quote everywhere in the last couple of months.

"“Science is more than a body of knowledge; it is a way of thinking. I have a foreboding of an America in my children’s or grandchildren’s time—when the United States is a service and information economy; when nearly all the key manufacturing industries have slipped away to other countries; when awesome technological powers are in the hands of a very few, and no one representing the public interest can even grasp the issues; when the people have lost the ability to set their own agendas or knowledgeably question those in authority; when, clutching our crystals and nervously consulting our horoscopes, our critical faculties in decline, unable to distinguish between what feels good and what’s true, we slide, almost without noticing, back into superstition and darkness.”"

wpietri•1mo ago

If we look back over the last century or so, I think we've made excellent progress on that. The main current barrier is that we've lately let people with various pathologies run wild, but historically that creates enough problems that the political will emerges. See, e.g., the American and French revolutions, or India's independence, or the US civil war and Reconstruction.

ctoth•1mo ago

2022/2023: "It hallucinates, it's a toy, it's useless."

2024/2025: "Okay, it works, but it produces security vulnerabilities and makes junior devs lazy."

2026 (Current): "It is literally the same thing as a psychic scam."

Can we at least make predictions for 2027? What shall the cope be then! Lemme go ask my psychic.

bopbopbop7•1mo ago

2022/2023: "Next year software engineering is dead"

2024: "Now this time for real, software engineering is dead in 6 months, AI CEO said so"

2025: "I know a guy who knows a guy who built a startup with an LLM in 3 hours, software engineering is dead next year!"

What will be the cope for you this year?

aspenmartin•1mo ago

The cope + disappointment will be knowing that a large population of HN users will paint a weird alternative reality. There are a multitude of messages about AI that are out there, some are highly detached from reality (on the optimistic and pessimistic side). And then there is the rational middle, professionals who see the obvious value of coding agents in their workflow and use them extensively (or figure out how to best leverage them to get the most mileage). I don't see software engineering being "dead" ever, but the nature of the job _has already changed_ and will continue to change. Look at Sonnet 3.5 -> 3.7 -> 4.5 -> Opus 4.5; that was 17 months of development and the leaps in performance are quite impressive. You then have massive hardware buildouts and improvements to stack + a ton of R&D + competition to squeeze the juice out of the current paradigm (there are 4 orders of magnitude of scaling left before we hit real bottlenecks) and also push towards the next paradigm to solve things like continual learning. Some folks have opted not to use coding agents (and some folks like yourself seem to revel in strawmanning people who point out their demonstrable usefulness). Not using coding agents in Jan 2026 is defensible. It won't be defensible for long.

bopbopbop7•1mo ago

Please do provide some data for this "obvious value of coding agents". Because right now the only thing obvious is the increase in vulnerabilities, people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

aspenmartin•1mo ago

Sure: at my MAANG company, where I watch the data closely on adoption of CC and other internal coding agent tools, most (significant) LOC are written by agents, and most employees have adopted coding agents as WAU, and the adoption rate is positively correlated with seniority.

Like a lot of things LLM related (Simon Willison's pelican test, researchers + product leaders implementing AI features) I also heavily "vibe" check the capabilities myself on real work tasks. The fact of the matter is I am able to dramatically speed up my work. It may be actually writing production code + helping me review it, or it may be tasks like: write me a script to diagnose this bug I have, or build me a streamlit dashboard to analyze + visualize this ad hoc data instead of me taking 1 hour to make visualizations + munge data in a notebook.

> people claiming they are 10x more productive but aren't shipping anything, and some AI hype bloggers that fail to provide any quantitative proof.

what would satisfy you here? I feel you are strawmanning a bit by picking the most hyperbolic statements and then blanketing that on everyone else.

My workflow is now:

- Write code exclusively with Claude

- Review the code myself + use Claude as a sort of review assistant to help me understand decisions about parts of the code I'm confused about

- Provide feedback to Claude to change / steer it away or towards approaches

- Give up when Claude is hopelessly lost

It takes a bit to get the hang of the right balance but in my personal experience (which I doubt you will take seriously but nevertheless): it is quite the game changer and that's coming from someone who would have laughed at the idea of a $200 coding agent subscription 1 year ago

bopbopbop7•1mo ago

Anecdotes don’t prove anything, ones without any metrics, and especially at MAANG where AI use is strongly incentivized.

Evidence is peer reviewed research, or at least something with metrics. Like the METR study that shows that experienced engineers often got slower on real tasks with AI tools, even though they thought they were faster.

aspenmartin•1mo ago

That's why I gave you data! METR study was 16 people using Sonnet 3.5/3.7. Data I'm talking about is 10s of thousands of people and is much more up to date.

Some counter examples to METR that are in the literature but I'll just say: "rigor" here is very difficult (including METR) because outcomes are high dimensional and nuanced, or ecological validity is an issue. It's hard to have any approach that someone wouldn't be able to dismiss due to some issue they have with the methodology. The sources below also have methodological problems just like METR

https://arxiv.org/pdf/2302.06590 -- 55% faster implementing HTTP server in javascript with copilot (in 2023!) but this is a single task and not really representative.

https://demirermert.github.io/Papers/Demirer_AI_productivity... -- "Though each experiment is noisy, when data is combined across three experiments and 4,867 developers, our analysis reveals a 26.08% increase (SE: 10.3%) in completed tasks among developers using the AI tool. Notably, less experienced developers had higher adoption rates and greater productivity gains." (but e.g. "completed tasks" as the outcome measure is of course problematic)

To me, internal company measures for large tech companies will be most reliable -- they are easiest to track and measure, the scale is large enough, and the talent + task pool is diverse (junior -> senior, different product areas, different types of tasks). But then outcome measures are always a problem...commits per developer per month? LOC? task completion time? all of them are highly problematic, especially because its reasonable to expect AI tools would change the bias and variance of the proxy so its never clear if you're measuring the change in "style" or the change in the underlying latent measure of productivity you care about

bopbopbop7•1mo ago

To be fair, I’ll take a non-biased 16 person study over “internal measures” from a MAANG company that burned 100s of billions on AI with no ROI that is now forcing its employees to use AI.

aspenmartin•1mo ago

I could have guessed you would say that :) but METR is not an unbiased study either. Maybe you mean that METR is less likely to intentionally inflate their numbers?

If you insist or believe in a conspiracy I don’t think there’s really anything I or others will be able to say or show you that would assuage you, all I can say is I’ve seen the raw data. It’s a mess and again we’re stuck with proxies (which are bad since you start conflating the change in the proxy-latent relationship with the treatment effect). And it’s also hard and arguably irresponsible to run RCTs.

All I will say is: there are flaws everywhere. METR results are far from conclusive. Totally understandable if there is a mismatch between perception and performance. But also consider: even if task takes the same or even slightly more time, one big advantage for me is that it substantially reduces cognitive load so I can work in parallel sessions on two completely different issues.

bopbopbop7•1mo ago

I bet it does reduce your cognitive load, considering you, in your own words "Give up when Claude is hopelessly lost". No better way to reduce cognitive load.

aspenmartin•1mo ago

I give up using Claude when it gets hopelessly lost, and then my cognitive load increases.

anorwell•1mo ago

What do you think about the METR 50% task length results? About benchmark progress generally?

ben_w•1mo ago

I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.

The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.

This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.

As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.

When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.

But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."

It works on my [use case], but we can't always ship my [use case].

Ianjit•1mo ago

Meta internal study showed a 6-12% productivity uplift.

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

insin•1mo ago

> - Give up when Claude is hopelessly lost

You love to see "Maybe completely waste my time" as part of the normal flow for a productivity tool

aspenmartin•1mo ago

That negates everything else? If you have a tool that can boost you for 80% of your work and for the other 20% you just have to do what you’re already doing, is that bad?

shimman•1mo ago

There's a reason why sunk cost IS a fallacy and not a sound strategy.

Denzel•1mo ago

We probably work at the same company, given you used MAANG instead of FAANG.

As one of the WAU (really DAU) you’re talking about, I want to call out a couple things: 1) the LOC metrics are flawed, and anyone using the agents knows this - eg, ask CC to rewrite the 1 commit you wrote into 5 different commits, now you have 5 100% AI-written commits; 2) total speed up across the entire dev lifecycle is far below 10x, most likely below 2x, but I don’t see any evidence of anyone measuring the counterfactuals to prove speed up anyways, so there’s no clear data; 3) look at token spend for power users, you might be surprised by how many SWE-years they’re spending.

Overall it’s unclear whether LLM-assisted coding is ROI-positive.

ben_w•1mo ago

To add to your point:

If the M stands for Meta, I would also like to note that as a user, I have been seeing increasingly poor UI, of the sort I'd expect from people committing code that wasn't properly checked before going live, as I would expect from vibe coding in the original sense of "blindly accept without review". Like, some posts have two copies of the sender's name in the same location on screen with slightly different fonts going out of sync with each other.

I can easily believe the metrics that all [MF]AANG bonuses are denominated in are going up, our profession has had jokes about engineers gaming those metrics even back when our comics were still printed in books: https://imgur.com/bug-free-programs-dilbert-classic-tyXXh1d

aspenmartin•1mo ago

Oh yes all of this I agree with. I had tried to clarify this above but your examples are clearer: my point is: all measures and studies I have personally seen of AI impact on productivity have been deeply flawed for one reason or another.

Total speed up is WAY less than 10x by any measure. 2x seems too high too.

By data alone it’s a bit unclear of impact I agree. But I will say there seems to be a clear picture that to me, starting from a prior formed from personal experience, indicates some real productivity impact today, with a trajectory that suggests these claims of a lot of SWE work being offloaded to agents over the next few years seems not that far fetched.

- adoption and retention numbers internally and externally. You can argue this is driven by perverse incentives and/or the perception performance mismatch but I’m highly skeptical of this even though the effects of both are probably really, it would be truly extraordinary to me if there weren’t at least a ~10-20% bump in productivity today and a lot of headroom to go as integration gets better and user skill gets better and model capabilities grow

- benchmark performance, again benchmarks are really problematic but there are a lot of them and all of them together paint a pretty clear picture of capabilities truly growing and growing quickly

- there are clearly biases we can think of that would cause us to overestimate AI impact, but there are also biases that may cause us to underestimate impact: e.g. I’m now able to do work that I would have never attempted before. Multitasking is easier. Experiments are quicker and easier. That may not be captured well by e.g. task completion time or other metrics.

I even agree: quality of agentic code can be a real risk, but:

- I think this ignores the fact that humans have also always written shitty code and always will; there is lots of garbage in production believe me, and that predates agentic code

- as models improve, they can correct earlier mistakes

- it’s also a muscle to grow: how to review and use humans in the loop to improve quality and set a high bar

Denzel•4w ago

Great response, we’re like 98% aligned at a high-level. :) These next few years will be interesting.

Ianjit•1mo ago

The productivity uplift is massive, Meta got a 6-12% productivity uplift from AI coding!

https://youtu.be/1OzxYK2-qsI?si=8Tew5BPhV2LhtOg0

nsxwolf•1mo ago

The nature of my job has always been fighting red tape, process, and stake holders to deploy very small units of code to production. AI really did not help with much of that for me in 2025.

I'd imagine I'm not the only one who has a similar situation. Until all those people and processes can be swept away in favor of letting LLMS YOLO everything into production, I don't see how that changes.

aspenmartin•1mo ago

No I think that's extremely correct. I work at a MAANG where we have the resources to hook up custom internal LLMs and agents to actually deal with that but that is unique to an org of our scale.

ben_w•1mo ago

> You then have massive hardware buildouts and improvements to stack + a ton of R&D + competition to squeeze the juice out of the current paradigm (there are 4 orders of magnitude of scaling left before we hit real bottlenecks)

This is a surprising claim. There's only 3 orders of magnitude between US data centre electricity consumption and worldwide primary energy (as in, not just electricity) production. Worldwide electricity supply is about 3/20ths of world primary energy, so without very rapid increases in electricity supply there's really only a little more than 2 orders of magnitude growth possible in compute.

Renewables are growing fast, but "fast" means "will approach 100% of current electricity demand by about 2032". Which trend is faster, growth of renewable electricity or growth of compute? Trick question, compute is always constrained by electricity supply, and renewable electricity is growing faster than anything else can right now.

aspenmartin•1mo ago

This is not my own claim, it’s based on the following analysis from Epoch: https://epoch.ai/blog/can-ai-scaling-continue-through-2030

But I forgot how old that article is: it’s 4 orders of magnitude past GPT-4 in terms of total compute which is I think only 3.5 orders of magnitude from where we are today (based on 4.4x scaling/yr)

ben_w•1mo ago

I went from using ChatGPT 3.5 for functions and occasional scripts…

… to one of the models in Jan 2024 being able to repeatedly add features to the same single-page web app without corrupting its own work or hallucinating the APIs it had itself previously generated…

… to last month using a gifted free week of Claude Code to finish one project and then also have enough tokens left over to start another fresh project which, on that free left-over credit, reached a state that, while definitely not well engineered, was still better than some of the human-made pre-GenAI nonsense I've had to work with.

Wasn't 3 hours, and I won't be working on that thing more this month either because I am going to be doing intensive German language study with the goal of getting the language certificate I need for dual citizenship, but from the speed of work? 3 weeks to make a startup is already plausible.

I won't say that "software engineering" is dead. In a lot of cases however "writing code" is dead, and the job of the engineer should now be to do code review and to know what refactors to ask for.

bopbopbop7•1mo ago

So you did some basic web development and built a "not well engineered" greenfield app that you didn't ship, and from that your conclusion is that "writing code is dead"?

ben_w•1mo ago

In half a week with left-over credit.

What do you think the first half of the credit was spent on?

In addition to the other projects it finished off for me, the reason I say "coding is dead" is that even this mediocre quality code is already shippable. Customers do not give a toss if it has clean code or nicely refactored python backend, that kind of thing is a pain point purely for developers, and when the LLM is the developer then the LLM is the one who gets to be ordered to pay down the technical debt.

The other project (and a third one I might have done on a previous free trial) are as complete as I care to make them. They're "done" in a way I'm not used to being possible with manual coding, because LLMs can finish features faster than I can think of new useful features to add. The limiting factor is my ability to do code review, or would be if I got the more expensive option, as I was on a free trial I could do code review about twice as fast as I burned through tokens (given what others say about the more expensive option that either means I need to learn to code review faster, or my risk tolerance is lower than theirs).

Now, is my new 3-day web app a viable business idea? It would've been shippable as-is 5-6 years ago, I saw worse live around then. Today? Hard to say, if markets were efficient then everyone would know LLMs can create this kind of thing so easily and nobody could charge for them, but people like yourself who disbelieve are an example of markets not being efficient, people like you can have apps like these sold to them.

That said, I try not to look at where the ball is but where it is going. For business ideas, I have to figure out what *doesn't* scale, and do that. Coding *does* scale now, that's why coding is dead.

I expect to return to this project in a month. Have one of the LLMs expand it and develop it for more than 3 the days spent so far, turn it into something I'd actually be happy to sell. Like I said, it seems like we're at "3 weeks" not "3 hours" for a decent MVP by current standards, but the floor is rising fast.

wpietri•1mo ago

I suppose it's appropriate that you hallucinated an argument I did not make, attacked the straw man, and declared victory.

ben_w•1mo ago

Ironically, the human tendency to read far too much into things for which we have far too little data, does seem to still be one of the ways we (and all biological neural nets) are more sample-efficient than any machine learning.

I have no idea if those two points, ML and brains, are just different points on the same Pareto frontier of some useful metrics, but I am increasingly suspecting they might be.

Atomic_Torrfisk•1mo ago

> People are willing to pay $200 per month

Some people are of course, but how many?

> ... People are willing to pay $200 per month

This is just low-key hype. Careful with your portfolio...

HumblyTossed•1mo ago

It's a great tool, but right now it's only being used to feed the greed.

>> Again, I guess no one knew AI would be as big as it is today, and it is only just started.

People have been saying similar about self driving cars for years now. "AI" is another one of those expensive ideas that we'll get 85% of the way there and then to get the other 15% will be way more expensive than anyone will want to pay for. It's already happening - HW prices and electricity - people are starting to ask, "if I put more $ into this machine, when am I actually going to start getting money out?" The "true believers" are like, soon! But people are right to be hugely skeptical.

jliptzin•1mo ago

There are some things it's really great at. For example, handling a css layout. If we have to spend trillions of dollars and get nothing else out of it other than being able to vertically center a <div> without wrestling with css and wanting to smash the keyboard in the process, it will all have been worth it.

falkensmaize•1mo ago

Not to be cheeky, but isn’t this just

display: flex; align-items:center;

now?

aspenmartin•1mo ago

I agree -- skepticism is totally healthy. And there are so many great ways to poke holes in the true underlying narratives (not the headlines that people seem to pull from). E.g. evaluation science is a wasteland (not for wont of very smart people trying very hard to get them right). How do we tackle the power requirements in a way that is sustainable? Etc. etc.

But stuff like this im not sure I understand:

> It's a great tool, but right now it's only being used to feed the greed.

if its a great tool, then how is it _only_ being used to "feed the greed" and what do you mean by that?

Also I think folks are quick to make analogies to other points in history: "AI is like the dot com boom we're going to crash and burn" and "AI is like {self driving cars, crypto, etc} and the promises will all be broken, its all hype" but this removes the nuance: all of these things are extremely different with very specific dynamics that in _some_ ways may be similar but in many crucial and important ways are completely different.

HumblyTossed•1mo ago

>> if its a great tool, then how is it _only_ being used to "feed the greed" and what do you mean by that?

Look around?

aspenmartin•1mo ago

Very confused, I still don’t know what you mean at all

signatoremo•1mo ago

You meant someone using Claude Code is greedy?

belter•1mo ago

>> But it is hard to argue against the value of current AI, which many of the vocal critics on HN seems to have the opinion of.

What is the concrete business case? Can anyone point to a revenue producing company using AI in production, and where AI is a material driver of profits?

Tool vendors don’t count. I’m not interested in how much money is being made selling shovels...show me a miner who actually struck gold please.

tim333•1mo ago

A lot of programmers seem willing to pay for the likes of Claude Code, presumably because it helps them get more done. Programmers cost money so that's a potential cost saving?

layer8•1mo ago

> Every single part of the hardware stack are being fused with money and demand. The last time we have this was Post-PC / Smartphone era which drove the hardware industry forward for 10 - 15 years. The current AI can at least push hardware for another 5 - 6 years while pulling forward tech that was initially 8 - 10 years away.

It’s very unclear how much end-consumer hardware and DIY builders will benefit from that, as opposed to server-grade hardware that only makes sense for the enterprise marker. It could have the opposite effect, like hardware manufacturers leaving the consumer market (as in the case of Micron), because there’s just not that much money in it.

mrheosuper•1mo ago

I'm not against AI/LLM(in fact, i am quite supportive to it). But one of my biggest fear is overusing AI. We may introduce some tool that only "AI/LLM" can resonably do(Like tool with weird, convoluted UI/UX, syntax) and no one against it because AI/LLM can use/interact.

Then genAI, It's become more and more difficult to tell which is AI and which is not, and AI is in everywhere. I dont know what to think about it. "If you can't tell, does it matter ?"

netdur•1mo ago

i think the concern about software shifting toward ai design ignores that the web hasn't been human-first for a long time. most traffic is already machine to machine, like crawlers and ci pipelines. we’ve tolerated systems that are barely legible for years. anyone who has grepped through android studio logs knows that human readability is usually a tertiary goal at best. ai interacting with complex systems is just an evolution of the glue code we’ve always written.

as for who made it, utility usually matters more than where it came from. i used an agent for an oss changelog recently and it picked up things i’d forgotten while structuring the narrative better than i could. the intent and code were mine, but the ai acted as a high fidelity compressor. the risk isn't ai being everywhere. it’s the atrophy of judgment where we stop using it to support decisions and start using it to outsource thinking.

jama211•1mo ago

The difference between the performance of models between 2024 and 2025 has been so stark, that graph really shows it. There are still many people on these forums who seem to think AI’s produce terrible code unless ultra supervised, and I can’t help but suspect some of them tried it a little while ago and just don’t understand how different it is now compared to even quite recently.

Madmallard•1mo ago

I used Gemini Pro, Claude Pro yesterday a couple of dozen times and basically have been daily.

I have a project to convert my multiplayer XNA game from C# to Javascript and to add networking to the game-play using LLMs.

They are far worse at it now than they were a year ago. They actually implemented the requirements (Though inaccurately) to the best of their ability a year ago. Especially Gemini.

Now they don't even come remotely close to implementing just the basic requirements.

The thing is, I'm giving them the entirety of the C# source code and spelling out what they should do.

simonw•1mo ago

Weird. I would expect Gemini 3 Pro and Claude Opus 4.5 to run rings around Gemini 1.5 Pro and Claude Sonnet 3.5.

How are you running them - regular chat interface or do you have them setup with Claude Code or Gemini CLI?

Madmallard•1mo ago

Using the chat interface primarily with various prompting strategies.

I am considering making a thread where I compel others to attempt to get what I'm trying to get out of it and show me their work.

The game is only around 25000-30000 LOC in C#.

simonw•1mo ago

I'd be happy to join such a thread.

jennyholzer3•1mo ago

"They are far worse at it now than they were a year ago."

This is the part they REALLY don't want you to say.

They can no longer train these models effectively and their performance is slipping. Late 2023 was the golden age.

andai•1mo ago

Re: yolo mode

I looked into docker and then realized the problem I'm actually trying to solve was solved in like 1970 with users and permissions.

I just made a agent user limited to its own home folder, and added my user to its group. Then I run Claude code etc as the agent user.

So it can only read write /home/agent, and it cannot read or write my files.

I add myself to agent group so I can read/write the agent files.

I run into permission issues sometimes but, it's pretty smooth for the most part.

Oh also I gave it root to a $3 VPS. It's so nice having a sysadmin! :) That part definitely feels a bit deviant though!

jillesvangurp•1mo ago

I use a qemu vm for running codex cli in yolo mode and use simple ssh based git operations for getting code in and out of there. Works great. And you can also do fun things like let it loose on multiple git projects in one prompt. The vm can run docker as well which helps with containerized tests and other more complicated things. One thing I've started to observe is that you spend more time waiting for tool execution than for model inference. So having a fast local vm is better than a slower remote one.

some_developer•1mo ago

Docker in docker, with opencode.

Opencode plus some scripts on host and in its container works well to run yolo and only see what it needs (via mounting). Has git tools but can't push etc. is thought how to run tests with the special container-in-container setup.

Including pre-configured MCPs, skills, etc.

The best part is that it just works for everyone on the team, big plus.

knicholes•1mo ago

cgroups and namespaces

staeff777•1mo ago

I really like this idea and just tried some steps for myself. create user with homedir: sudo useradd -m agent add myself to agent group: sudo usermod -a -G agent $USER

Allow agent group to agent home dir: sudo chmod -R 770 /home/agent

Start a new shell with the group (or login/logoff): newgrp agent Now you should be able to change into the agent home.

Allow your user to sudo as agent: echo "$USER ALL=(agent) NOPASSWD: ALL" |sudo tee -a /etc/sudoers.d/$USER-as-agent now you can start your agent using sudo: sudo -u agent your_agent

works nice.

andai•1mo ago

Re: yolo mode

https://markdownpastebin.com/?id=1ef97add6ba9404b900929ee195...

My notes from back when I set this up! Includes instructions for using a GUI file explorer as the agent user. As well as setting up a systemd service to fix the permissions automatically.

(And a nice trick which shows you which GUI apps are running as which user...)

However, most of these are just workarounds for the permission issue I kept running into, which is that Claude Code would for some reason create files with incorrect permissions so that I couldn't read or write those files from my normal account.

If someone knows how to fix that, or if someone at Anthropic is reading, then most of this Rube Goldberg machine becomes unnecessary :)

yupyupyups•1mo ago

Let's talk about the societal cost these models have had on us including their high energy cost and the proliferation of auto-generated slop media used to milk ad revenue, scam people, SEO farm, do propaganda or automate trolling. What about these big corporations collecting an astronomical amount of debt to hoard DRAM and NAND in a way that has crippled the PC market within weeks? And what are they going to do next, put a few dollars in Trump's pocket so that they can rob/loot the US population through bailouts? Who gets to keep all the hardware I wonder?

Nvidia, Samsung, SK Hynix and some other voltures I forgot to mention are making serious bank right now.

jennyholzer3•1mo ago

> Who gets to keep all the hardware I wonder?

Keep questions like this off of the propaganda thread.

ashishgupta2209•1mo ago

2026: The Year of Robots, note it for next year

fullstackchris•1mo ago

> The reason I think MCP may be a one-year wonder is the stratospheric growth of coding agents. It appears that the best possible tool for any situation is Bash—if your agent can run arbitrary shell commands, it can do anything that can be done by typing commands into a terminal.

I push back strongly from this. In the case of the solo, one-machine coder, this is likely the case - if you're exposing workflows or fixed tools to customers / collegues / the web at large via API or similar, then MCP is still the best way to expose it IMO.

Think about a GitHub or Jira MCP server - commandline alone they are sure to make mistakes with REST requests, API schema etc. With MCP the proper known commands are already baked in. Remember always that LLMs will be better with natural language than code.

simonw•1mo ago

The solution to that is Anthropic's Skills.

Create a folder called skills/how-to-use-jira

Add several Bash scripts with the right curl commands to perform specific actions

Add a SKILL.md file with some instructions in how to use those scripts

You've effectively flattened that MCP server into some Markdown and Bash, only the thing you have now is more flexible (the coding agent can adapt those examples to cover new things you hadn't thought to tell it) and much more context-efficient (it only reads the Markdown the first time you ask it to do something with JIRA).

aflukasz•1mo ago

But that moves the burden of maintenance from the provider of the service to its users (and/or partially to intermediary in form of "skills registry" of sorts, which apparently is a thing now).

So maybe a hybrid approach would make more sense? Something like /.well-known/skills/README.md exposed and owned by the providers?

That is assuming that the whole idea of "skills" makes sense in practice.

simonw•1mo ago

Yeah that's true, skill distribution isn't a solved problem yet - MCPs have a URL, which is a great way of making them available for people to start using without extra steps.

rr808•1mo ago

What happened to Devin? 2024 it was a leading contender now it isn't even included in the big list of coding agents.

fullstackchris•1mo ago

Wasn't it basically revealed as a scam? I remember some article about their fancy demo video being sped up / unfairly cut and sliced etc.

monkeydust•1mo ago

https://cognition.ai/blog/devin-annual-performance-review-20...

ColinEberhardt•1mo ago

It’s still around, and tends to be adopted by big enterprises. It’s generally a decent product, but is facing a lot of equally powerful competition and is very expensive.

simonw•1mo ago

To be honest that's more because I've never tried it myself, so it isn't really on my radar.

I don't hear much buzz about it from the people I pay attention to. I should still give it a go though.

Gud•1mo ago

What about self hosting?

simonw•1mo ago

I talked about that in this section https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-... - and touched on it a bit in the section about Chinese AI labs: https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-...

politelemon•1mo ago

> The problem is that the big cloud models got better too—including those open weight models that, while freely available, were far too large (100B+) to run on my laptop.

The actual, notable progress will be models that can run reasonably well on commodity, everyday hardware that the average user has. From more accessibility will come greater usefulness. Right now the way I see it, having to upgrade specs on a machine to run local models keeps it in a niche hobbyist bubble.

timonoko•1mo ago

OpenSCAD-coding has improved significantly on all models. Now syntax is always right and they understand the concept of negative space.

Only problem is that they don't see connection between form and function. They may make teapot perfectly but don't understand that this form is supposed to contain liquid.

mark_l_watson•1mo ago

Thanks Simon, great writeup.

It has been an amazing year, especially around tooling (search, code analysis, etc.) and surprisingly capable smaller models.

huqedato•1mo ago

I completely disagree with the idea that 2025 "The (only?) year of MCP." In fact, I believe every year in the foreseeable future will belong to MCP. It is here to stay. MCP was the best (rational, scalable, predictable) thing since LLM madness broke loose.

mmcnl•1mo ago

Let's hope 2026 will also have interesting innovations not related to AI or LLMs.

spicyusername•1mo ago

2025 had plenty of those, they just didn't get as many news headlines.

One of the difficult things of modernity is that it's easy to confuse what you hear about a lot with what is real.

One of the great things about modernity is that progress continues, whether we know about it or not.

_pdp_•1mo ago

With everything that we have done so far (our company) I believe by end of 2026 our software will be self improving all the time.

And no it is not AI slop and we don't vibe code. There are a lot of practical aspects of running software and maintaining / improving code that can be done well with AI if you have the right setup. It is hard to formulate what "right" looks like at this stage as we are still iterating on this as well.

However, in our own experiments we can clearly see dramatic increases in automation. I mean we have agents working overnight as we sleep and this is not even pushing the limits. We are now wrapping major changes that will allows us to run AI agents all the time as long as we can afford them.

I can even see most of these materialising in Q1 2026.

Fun times.

papacj657•1mo ago

What exactly are your agents doing overnight? I often hear folks talk about their agents running for long periods of time but rarely talk about the outcomes they're driving from those agents.

_pdp_•1mo ago

We have a lot of grunt work scheduled overnight like finding bugs, creating tests where we don’t have good coverage or where we can improve, integrations, documentation work, etc.

Not everything gets accepted. There is a lot of work that is discarded and much more pending verification and acceptance.

Frankly, and I hope I don’t come as alarmist (judge for yourself from my previous comments on Hn and Reddit) we cannot keep up with the output! And a lot of it is actually good and we should incorporate it even partially.

At the moment we are figuring out how to make things more autonomous while we have the safety and guardrails in place.

The biggest issue I see at this stage is how to make sense of it all as I do not believe we have the understanding of what is happening - just the general notion of it.

I truly believe that we will reach the point where ideas matter more than execution, which what I would expect to be the case with more advanced and better applied AI.

asgR1t•1mo ago

Most LLMs got worse in 2025. Only addicts and the type of computer gamer that feels drawn to complex setups, gamification and does not care about the end result will feel positive about the grift.

2025: The Year in Open Source? Nothing, all resources were tied up to debunk a couple of Python web developers who pose as the ultimate experts in LLMs.

simonw•1mo ago

In what way did they get worse?

I made you a dashboard of my 2025 writing about open-source that didn't include AI: https://simonwillison.net/dashboard/posts-with-tags-in-a-yea...

icapybara•1mo ago

It was the year of Claude Code

nativeit•1mo ago

Between the people with invested and/conflicting interests, and the hordes of dogmatic zealots, I find discussions about AI to be the least productive or reliably informed on HN.

simonw•1mo ago

Honestly this thread was pretty disappointing. Many of the comments here could have been attached to any post about LLMs in the past year or so.

ck2•1mo ago

as I was clicking "gee I hope there's the year of pelicans riding bicycles"

left satisfied, lol

losvedir•1mo ago

I predict 2026 will be the year of the first AI Agent "worm" (or virus?). Kind of like the Morris worm running amok as an experiment gone wrong, I think we will sometime soon have someone set up an AI agent whose core loop is to try to propagate itself, either as an experiment or just for the lulz.

The actual Agent payload would be very small, likely just a few hundred line harness plus system prompt. It's just a question of whether the agent will be skilled enough to find vulnerabilities to propagate. The interesting thing about an AI worm is that it can use different tricks on different hosts as it explores its own environment.

If a pure agent worm isn't capable enough, I could see someone embedding it on top of a more traditional virus. The normal virus would propagate as usual, but it would also run an agent to explore the system for things to extract or attack, and to find easy additional targets on the same internal network.

A main difference here is that the agents have to call out to a big SotA model somewhere. I imagine the first worm will simply use Opus or ChatGPT with an acquired key, and part of it will be trying to identify (or generate) new keys as it spreads.

Ultimately, I think this worm will be shut down by the model vendor, but it will have to have made a big enough splash beforehand to catch their attention and create a team to identify and block keys making certain kinds of requests.

I'd hope OpenAI, Anthropic, etc have a team and process in place already to identify suspicious keys, eg, those used from a huge variety of IPs, but I wouldn't be surprised if this were low on their list of priorities (until something like this hits).

zerocool86•1mo ago

The "local models got good, but cloud models got even better" section nails the current paradox. Simon's observation that coding agents need reliable tool calling that local models can't yet deliver is accurate - but it frames the problem purely as a capability gap.

There's a philosophical angle being missed: do we actually want our coding agents making hundreds of tool calls through someone else's infrastructure? The more capable these systems become, the more intimate access they have to our codebases, credentials, and workflows. Every token of context we send to a frontier model is data we've permanently given up control of.

I've been working on something addressing this directly - LocalGhost.ai (https://www.localghost.ai/manifesto) - hardware designed around the premise that "sovereign AI" isn't just about capability parity but about the principle that your AI should be yours. The manifesto articulates why I think this matters beyond the technical arguments.

Simon mentions his next laptop will have 128GB RAM hoping 2026 models close the gap. I'm betting we'll need purpose-built local inference hardware that treats privacy as a first-class constraint, not an afterthought. The YOLO mode section and "normalization of deviance" concerns only strengthen this case - running agents in insecure ways becomes less terrifying when "insecure" means "my local machine" rather than "the cloud plus whoever's listening."

The capability gap will close. The trust gap won't unless we build for it.

tantony•1mo ago

Claude Opus 4.5 has been a big step up for me personally, and I used to think Sonnet 3.5 was good. It is an amazing deal at $20.

Just yesterday, it helped me parse out and understand a research paper - complete with step-by-step examples (this one: https://research.nvidia.com/sites/default/files/pubs/2016-03...). I will now go ahead and implement it myself, possibly relegating some of the more grunt-work type tasks to Claude code.

Without it, I would have been struggling through the paper for days, wading through WGSL shader code and there would be a high chance that I just give up on it since this is for a side project and not my $job.

It has been a major force multiplier just for learning things. I have had the $20 subscription for about a year now. I bump it up to the $100 plan if I happen to be working on some project that eats through the $20 allocation. This happens to be one such month. I will probably go back to the $20 plan after this month. I continue to get a lot of value out of it.

rldjbpin•1mo ago

while most of the discourse is around text and (multimodal) LLMs, the past year has been quite interesting in other media as well. i suppose the "slop" section did hint on it briefly.

while LLM-generated text was already a thing of the past couple years, this year images and videos had the "AI or not" moment. it appears to have a bigger impact than our myopic world of software. another trend towards the end of the year was around "vibe training" of new (albeit much smaller) AI models.

personally, getting up and running with a project has been easier than ever, but unlike OP, i don't share the same excitement to make anymore. perhaps vibe coding with a phone will get more streamlined with a killer app in 2026.

OpenAI is Broke and so is everyone else [video][10M]

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

eInk UI Components in CSS

Discuss – Do AI agents deserve all the hype they are getting?

ChatGPT is changing how we ask stupid questions

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

OpenAI is Broke and so is everyone else [video][10M]

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

eInk UI Components in CSS

Discuss – Do AI agents deserve all the hype they are getting?

ChatGPT is changing how we ask stupid questions

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

2025: The Year in LLMs

Comments