The names Haiku, Sonnet, and Opus have not been chosen randomly.
I think it would easily have taken me 4+ hours to do that. It ran in 15 minutes while I played Kirby Air Riders and worked on the first try.
Afterward, I sort of had to reflect on the fact that I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature. It kept me learning the thing I cared about rather than doing a side quest.
At best it plods along as you keep badgering Claude to fix it, until inevitably Claude reaches a point where it can't help. At which time you'll be forced to spend at least the 4 hours you would have originally spent trying to understand it so you can fix it yourself.
At worst the thing will actively break other things you do understand in ways you don't understand, and you'll have to spend at least 4 hours cleaning up the mess.
Either way it's not clear you've saved any time at all.
Perhaps not. If LLMs keep getting better, more competent models can help him stay on top of it lol.
My proto+sqlite+mesh project recently hit the point where it's too big for Claude to maintain a consistent "mental model" of how eg search and the db schemas are supposed to be structured, kept taking hacky workarounds by going directly to a db at the storage layer instead of the API layer, etc. so I hit an insane amount of churn trying to get it to implement some of the features needed to get it production ready.
Here's the whackamole/insanity documented in git commit history: https://github.com/accretional/collector/compare/main...feat...
But now I know some new tricks and intuition for avoiding this situation going forward. Because I do understand the mental model behind what this is supposed to look like at its core, and I need to maintain some kind of human-friendly guard rails, I'm adding integration tests in a different repo and a README/project "constitution" that claude can't change but is accountable for maintaining, and configuring it to keep them in context while working on my project.
Kind of a microcosm of startups' reluctance to institute employee handbook/kpis/PRDs followed by resignation that they might truly be useful coordination tools.
Great for Opus because you’re now a captive customer.
I think that could work, but it can work in the same way that plenty of big companies have codebases that are a giant ball of mud and yet they somehow manage to stay in business and occasionally ship a new feature.
Meanwhile their rivals with well constructed codebases who can promptly ship features that work are able to run rings around them.
I expect that we'll learn over time that LLM-managed big ball of mud codebases are less valuable than LLM-managed high quality well architected long-term maintained codebases.
Honestly, i’m making stuff up, as I don’t think it’s feasible right now because of the context sizes. But given how fast things develop, maybe in a couple of years things might change.
AI is still incredibly useful used in tandem, but have it implement full feature from one sentence usually lead to doom.
Opus/Anthropic is hands down the best in my experience. But using it feels like intellectual fast food (they all are), I hate the fact that I can build something like a neatly presentable one off spa tool (ty Simon) when I'm barely paying attention. it feels unsatisfying to use.
EDIT: because I'm rambling, I like "AI" as much as the next guy, probably more because I was there before it turned into LLMs"R"US, but I also like(d) the practice of sitting around listening to music solving problems with Scala. I don't know why we've decided to make work less fun..
There are just too many parts involved to do anything. For example today I built a simple data collection app to use on my phone that involves inventories with photos for a tedious workflow I have to do. I knew what I wanted but didn't know how to even choose which tools to bother learn. And just even trying things to see if an approach works or not without spending hours learning one thing or another or wading through the hell of web search is really great.
Things I learned today that I figure everyone else must know: if you want to take a photo from a webapp I guess you need https. So I decided to try mTLS (knew it existed but never had the time) so asked Claude to write me a short tutorial about setting it up, creating keys, importing them (including a cool single line trick of spinning up a python server and downloading the keys on my phone rather than find a USB stick or whatever). And then helping me figure out a path out of the suffering of Chrome and Firefox hating self-signed CA. But at least I figured out how to make Firefox happy. But it would insist on prompting me for the certificate for every htmx request. But chatting with Claude I learn caddy is pretty cool, it's go. Claude suggests an auth boxcar (wtf is a boxcar? Claude clues me in). We build one in go because I don't want to mix them into my app. (Incidentally I can't write go, but I can read and go seems safer than a pile of puthon for this simple thing) The boxcar was fine but Claude was struggling with getting headers to work. So while Claude is working on that I do a quick Google about whether caddy can have extensions. Interrupt Claude and suggest an extension instead of a boxcar. Claude's on board so we ditch the boxcar. Have Claude and codex evaluate the extension for security. They find important issues about things a jerk might do, fix them. So successful mTLS connections transition to session cookies. So my dumb CRUD tool doesn't have to worry about auth. Which it didn't have to do anyway except browsers say so etc because my phone is literally only able to access the server via VPN anyway.
Other things I have learned today that only wasted 5min of Claude's time rather than hours of mine: Firefox camera access can't control flash, focus or zoom. So call out to the native app instead.
This is all quite fun and the tool I'm building is going to really make my own life better.
I mean will you (we) retain all that it did after a few months go by? You may say we don't need to, but that sounds a little shallow given we're both on HN. Do you remember Gatsby's criticism of "Summer People"?
You could spend 4 hours (that you don't have) building that feature. Or... you could have the coding agent build it in the background for you in 15 minutes, then spend 30 minutes reading through what it did, tweaking it yourself and peppering it with questions about how it all works.
My hunch is that the 30 minutes of focused learning spent with a custom-built version that solves your exact problem is as effective (or even more effective) than four hours spent mostly struggling to get something up and running and going down various rabbit holes of unrelated problem-solving.
Especially if realistically you were never going to carve out those four hours anyway.
Of course, this kind of interactive deep engagement with a topic is fast becoming obsolete. But the essence to me of “knowing” is about doing and experiencing things, updating my bayesian priors dialectically (to put it fancily)
I don't think that's incompatible with getting help from LLMs. I find that LLMs let me try so much more stuff, and at such a faster rate, that my learning pace has accelerated in a material way.
Something I'm really interested right now is the balance in terms of the struggle required to learn something.
I firmly believe that there are things where the only way to learn how to do them is to go through the struggle. Writing essays for example - I don't think you can shortcut learning to write well by having an LLM do that for you, even though actually learning to write is a painful and tiresome progress.
But programming... I've seen so many people who quit learning to program because the struggle was too much. Those first six months of struggling with missing semicolons are absolutely miserable!
I've spoken to a ton of people over the past year who always wanted to learn to program but never managed to carve out that miserable six months... and now they're building software, because LLMs have shaved down that learning curve.
LLMs can hurt less experienced engineers by keeping them from building an intuition for why things work a certain way, or why an alternative won't work (or conversely, why an unconventional approach might not only be possible, but very useful and valuable!).
I think problem solving is optimization in the face of constraints. Generally using LLMs IME, the more you're able to articulate and understand your constraints, and prescriptively guide the LLM towards something it's capable of doing, the more effective they are and the more maintainable their output is for you. So it really helps to know when to break the rules or to create/do something unconventional.
Another way to put it is that LLMs have commodified conventional software so learning when to break or challenge convention is going to be where most of the valuable work is going forward. And I think it's hard to actually do that unless you get into the weeds and battle/try things because you don't understand why they won't work. Sometimes they do
What I don't believe is that it HAS to be like this. Maybe it's my natural optimism showing through here, but I'm confident it's possible to accelerate rather than slow down your learning progress with LLMs, if you're thoughtful about how you apply them.
An open question for me is how feasible it is to teach people how to teach themselves effectively using this new technology.
I have a core belief that everything is learnable, if people are motivated to learn. I have no idea how to help instill that motivation in people who don't yet have it though!
It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.
It's interesting because like the article notes, AI is really smashing benchmarks, but actual usefulness in automation of thought work is proving much more elusive. I think that collective experience of AI just not being that useful, or as useful as benchmarks suggest it should be, is captured in this metric.
This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing.
Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week).
"Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind--probably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend on a given task is very different from the hours any particular person like you or I would spend.
Some of the appeal (and risk!!) of these things is specifically that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.
I cannot imagine GPT5.2 working on a task for more than 2 minutes, let alone 4 hours. I’m curious if you’ve run into this and figured out a way around it?
And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!
If true, how much of this is a result of:
1. Genuine technical advancement
or:
2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?
In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?
grim_io•1h ago
It matches the my personal feeling when using progressively better models over time.