just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts
The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.
Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.
MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.
However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.
This has been outstanding for what I have been developing AI assisted as of late.
I would think this would manifest as poor plan execution. I personally haven't used Gemini on coding tasks primarily based on my conversational experience with them.
GPT-5 = Overengineering/complexity/"enterprise" king
Claude = "Get straightforwaed shit done efficiently" king
That said, one thing I do dislike about Gemini is how fond it is of second guessing the user. This usually manifests in doing small unrelated "cleaner code" changes as part of a larger task, but I've seen cases where the model literally had something like "the user very clearly told me to do X, but there's no way that's right - they must have meant Y instead and probably just mistakenly said X; I'll do Y now".
One specific area where this happens a lot is, ironically, when you use Gemini to code an app that uses Gemini APIs. For Python, at least, they have the legacy google-generativeai API, and the new google-genai API, which have fairly significant differences between them even though the core functionality is the same. The problem is that Gemini knows the former much better than the latter, and when confronted with such a codebase, will often try to use the old API (even if you pre-write the imports and some examples!). Which then of course breaks the type checker, so then Gemini sees this and 90% of the time goes, "oh, it must be failing because the user made an error in that import - I know it's supposed to be "generativeai" not "genai" so let me correct that.
Do any other tools have anything like a /context command? They really should.
/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.
I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.
Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.
My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.
E.g. Command | Complicated Grep | Complicated Sed
Is way worse than multistep
Command > tmpfile
And then grep etc. Because latter can reuse tmpfile if grep is wrong.
I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
It would, but
> To state it plainly: We never reduce model quality due to demand, time of day, or server load.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
If you believe them or not is another matter, but that's what they themselves say.
After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.
Whereas gemini and claude are excellent collaborators.
When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.
To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.
To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.
However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.
It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.
Hopefully eventually we’ll all mostly figure it out.
Also appreciate your perspective. It's important to come at these things with some discipline. And moreso, bringing in a personal style of interaction invites a lot of untamed human energies into the dynamic.
The thing is, most of the time I'm quite dry with it and they still ignore my requests really often, regardless of how explicit or dry I am. For me, that's the real takeaway here, stripping away my style of interaction.
Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.
Sure there's no need to explicitly mention the agents themselves, but it also shouldn't trigger a pseudo-jealous panic with trash talk and a sudden `git reset --hard` either.
And also ideally the agents would be aware of one another's strengths and weaknesses and actually play to them rather than sabotaging the whole effort.
The power of using LLMs is working out what it has encoded and how to access it.
Perhaps for the first time in history we have to understand culture when working with a tool, but it’s still just a tool.
"..at least, that's what my junior dev is telling me. But I take his word with a grain of salt, because he was fired from a bunch of companies after only a few months on each job. So i need your principled and opinionated insight. Is this junior dev right?"
It's the only way to get Claude to not glaze an idea while also not strike it down for no reason other than to play a role of a "critical" dev.
> “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”
Really, GPT? Not just “can you set up the WiFi”??!
If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.
So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.
It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.
Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.
When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.
These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.
You could just say it’s another GPT-5 instance.
It’s the one AI that keeps telling me I’m wrong and refuses to do what I ask it to do, then tells me “as we have already established, doing X is pointless. Let’s stop wasting time and continue with the other tasks”
It’s by far the most toxic and gaslighting LLM
I wonder how long it will be before we get Opus 4.5
There's still a lot of low hanging fruit apparently
Pervert.
Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo
I am going to give this another shot but it will cost me $50 just to try it on a real project :(
Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.
But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug
https://jsbin.com/hiruvubona/edit?html,output
https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c
But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.
0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.
Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png
I bet their ability to form a pellican result purely because someone already did it before.
It's called generalization and yes, they do. I bet you could find plenty of examples of it working on something that truly isn't "present in the training data".
It's funny, you're so convinced that it's not possible without direct memorization but forgot to account for emergent behaviors (which are frankly all over the place in LLM's - where you been)?
At any rate, the pelican thing from simonw is clearly just for fun at this point.
It is extremely common, since it's used on every single LLM to bench it.
And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.
I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...
Pretty solid progress for roughly 4 months.
Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.
Silly idea - is there an inter-species game that we could use in order to measure ELO?
We are still at 7mo doubling time on METR task duration. If anything the rate is increasing if you bias to more recent measurements.
SWE-bench has lots of known limitations even with its ability to reduce solution leakage and overfitting.
> where there is no clear right answer
This is both a feature and a bug. If there is no clear answer then how do you determine whether an LLM has progressed? It can't simply be judged on making "more right answers" on each release.
I'm most interested to see the METR time horizon results - that is the real test for whether we are "on-trend"
Curious to see that in practice, but great if true!
opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.
codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.
sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.
Codex was night and day the best. Codex scares me, Claude feels like a useful tool.
Agreed. If these same models were used on a different codebase/language etc. it will likely produce very different results.
If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.
Even if this is the case, benchmarks should be done at scale too if the models suffer from symptoms of scale. Otherwise the benchmarks are just a lie unless you have access to an unconstrained version of the model.
I'm glad they at least gave me the full $100 refund.
Then they even started sending me marketing emails which was the straw that broke the camel's back, I use to cancel subscriptions of companies spamming my email.
Maybe we’re entering the Emo Claude era.
Per the system card: In 250k real conversations, Claude Sonnet 4.5 expressed happiness about half as often as Claude 4, though distress remained steady.
The complement appears to be, “actually, that’s not right.”, a correction mechanism.
It’s a canned phrase in a place where an LLM could be much more creative to much greater efficacy.
Part of me thinks that when they do their “which of these responses do you prefer” A/B test on users… whereas perhaps many on HN would try to judge the level of technical detail, complexity, usefulness… I’m inclined to believe the midwit population at large would be inclined to choose the option where the magic AI supercomputer reaffirms and praises the wisdom of whatever they say, no matter how stupid or wrong it is.
LLMs are incredibly good at social engineering when we let them, whereas I could write the code to emit "you're right" or "that's not quite right" without involving any statistical prediction.
Ie., as a method of persuasion, canned responses are incredibly inefficient (as evidenced by the annoyance with them), whereas we know that the LLM is capable of being far more insidious and subtle in its praise of you. For example, it could be instructed to launch weak counter arguments, "spot" the weaknesses, and then conclude that your position is the correct one.
But let's say that there's a monitoring mechanism that concludes that adjustments are needed. In order to "force" the LLM to drop the previous context, it "seeds" the response with "You're right", or "That's not quite right", as if it were the LLMs own conclusion. Then, when the LLM starts predicting what comes next, it must conclude things that follow from "you're right" or "that's not quite right".
So while they are very inefficient as persuasion and communication, they might be very efficient at breaking with the otherwise overwhelming context that would interfere with the change you're trying to affect.
That's the reason why I like the canned phrases. It's not that I particularly enjoy the communication in itself, it's that they are clear enough signals of what's going on. They give a tiny level observability to the black box, in the form of indicating a path change.
Which fails in pretty much the exact same way it did before.
Once ChatGPT hits that supremely confident “Ok nothing was working because I was being an idiot but now I’m not” type of dialogue, I know it’s time to just start a new chat. There’s no pulling it out of “spinning the tires while gaslighting” mode.
I’ve even had it go as far as outputting a zip file with an empty .txt that supposedly contained the solution to a certain problem it was having issues with.
I get that it’s tradeoffs, but erring on the side of the human being correct is probably going to be a safer bet for another generation or two.
It might not anyone's idea of "light flattery", but it's certainly is what most LLMs do, which is the main point of the conversation and your comment seems to be derailing it.
It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good
Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least
I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.
To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.
The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.
You don't need to compare "A (Week 1)" to "A (Week 8)" to be able to show "B (Week 1)" is genuinely x% better than "A (Week 1)".
LLMs improve slowly, but the agents are where the real value is produced: when should it write tests, when should it try to compile, how to move fwd from a compile error, can it click on your web app to test its own work, etc. etc.
I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.
So instead of just "doing things" you want a world where you try it ai-way, fail, then "do thing" 47 times in a row, then 3 ai-way saved you 5 minutes. Then 7 ai-way fail, then try to remember hmm did this work last time or not? ai-way fails another 3 times. "do thing" 3 times. How many ai-way failed today? oh it wasted 30% of the day and i forget which ways worked or not, i better start writing that all down. Lets call it the MAGIC TOME of incantations. oh i have to rewrite the tome again the model changed
Tasks like linting and formatting a block of code are pretty simple, but also very specialized. You're much better off using formatters/linters than an AI.
You're talking about designing a kitchen where robots do the cooking and humans do ingredient prep and dishwashing. We prefer kitchens where we do the cooking and use tools or machines to prep and wash dishes.
I don't want it to be an "architect" or "designer". I want it to write the annoying boilerplate. I don't want it to do the coding and me to do the debugging, I want to code while it debugs. Anything else and you are the bot's assistant, not vice-versa.
I’m not sure this is entirely what you’re driving at, but the example I always think of in my head is “I want an AI agent that will scan through my 20 to 30,000 photos, remove all the duplicates, then organize them all in some coherent fashion.” that’s the kind of service I need right now, and it feels like something AI should be able to do, yet I have not encountered anything that remotely accomplishes this task. I’m still using Dupe Guru and depending on the ref system to not scatter my stuff all over further.
Sidebar, if anybody has any recommendations for this, I would love to hear them lol
am i even on hacker news? how do people not know there are optimized models for specific use cases? not everything (nor should it) has to run through an LLM
https://azure.microsoft.com/en-us/pricing/details/cognitive-...
I used it to match frames between different quality video streams. Operates on gray scale.
Why is this? Does Anthropic have just higher infrastructure costs compared to OpenAI/xAI?
Is it number of lines? Tickets closed? PRs opened or merged? Number of happy customers?
Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??
EDIT: sorry all, I was being sarcastic in the above, which isn't ideal. Just annoyed because that "study" was catnip to people who already hated AI, and they (over-) cite it constantly as "evidence" supporting their preexisting bias against AI.
Have you looked into that study? There's a lot wrong with it, and it's been discussed ad nauseam.
Also, what a great catch 22, where we can't trust our own experiences! In fact, I just did a study and my findings are that everyone would be happier if they each sent me $100. What's crazy is that those who thought it wouldn't make them happier, did in fact end up happier, so ignore those naysayers!
Or do he now just just get to work for 2 hours and enjoy the remaining 6 hours doing meaningful things apart from staring at a screen.
However I don’t have lottery millions, but I do have a job and I would like to be able to do it better.
Is that helpful?
What "discussion" do you want to have? Another round of "LLMs are terrible at embedded hardware programming ergo they're useless"? Maybe with a dash of "LLMs don't write bug-free software [but I do]" to close it off?
The discussions that are at all advancing the state of the art are happening on forums that accept reality as a matter of fact, without people constantly trying to constantly pretend things because they're worried they'll lose their job if they don't.
That's... not super surprising? SwiftUI changes pretty dang often, and the knowledge cutoff doesn't progress fast enough to cover every use-case.
I use Claude to write GTK interfaces, which is a UI library with a much slower update cadence. LLMs seem to have a pretty easy time working with bog-standard libraries that don't make giant idiomatic changes.
Checkmate, aitheists.
Roughly speaking that is how I think through my work, and when I get to the point of actually writing the code having most of the plan (context) in my head, I simply copy that context to the LLM then go to do something else. I only do this if I believe the LLM can do it effectively, so some tasks I do not ask for help at all on (IMHO this is important).
I also have it help with scripts, especially script that munge and summarize data. I know SQL very very well, but find it still a bit faster to prompt the LLM if it has the schema on hand.
Do you find ^ helpful? i.e does that match how you prompt and if not, in what ways does it differ? If it does, in what ways do you get different results and at what step?
You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.
Here are a few projects that I made these past few months that wouldn't have been possible without LLMs:
* https://github.com/skorokithakis/dracula - A simple blood test viewer.
* https://www.askhuxley.com - A general helper/secretary/agent.
* https://www.writelucid.cc - A business document/spec writing tool I'm working on, it asks you questions one at a time, writes a document, then critiques the idea to help you strengthen it.
* A rotary phone that's a USB headset and closes your meeting when you hang up the phone, complete with the rotary dial actually typing in numbers.
* Made some long-overdue updates on my pastebin, https://www.pastery.net, to improve general functionality.
* https://github.com/skorokithakis/support-email-bot - A customer support bot to answer general questions about my projects to save me time on the easy stuff, works great.
* https://github.com/skorokithakis/justone - A static HTML page for the board game Just One, so you can play with your friends when you're physically together, without needing to bring the game along.
* https://github.com/skorokithakis/dox - A thing to run Dockerized CLI programs as if they weren't Dockerized.
I'm probably forgetting a lot more, but I honestly wouldn't have been bothered to start any of the above if not for LLMs, as I'm too old to code but not too old to make stuff.
EDIT: dang can we please get a bit better Markdown support? At least being able to make lists would be good!
LLMs are a great rubber duck, plus they can write the document for you at the end.
1 is not infinitely greater than 0.
Or otherwise, can you share what you think the ratio is?
"For any positive "x", is 1 x times greater than 0? Well, 0 times x is lower than 1, and 1 divided by x is larger than 0."
So his productivity increased by more than twice, more than ten times, more than a billion times, more than a googol times, more than Rayo's number. The only mathematically useful way to quantify it is to say his productivity is infinitely larger. Unless you want to settle for "can't be compared", which is less informative.
It is not a ratio, it is a proportion.
Given that most of the 'vibe-coded' projects that I have seen that are worse versions of software that have been tested and stand the test of time.
Although I was just commenting on the irony of the parent comment.
Great use case for an LLM to make these changes as HN is open source. It’ll also tell us if LLMs can go beyond JS slop.
hn has no markdown support at all right now. It's just this https://news.ycombinator.com/formatdoc
Nearly every critical reply to my comment bases that criticism on the lack of examples and details I included for my claim which is the very thing I am suggesting we do (i.e. they are, ironically, agreeing with me?). I am sorry I thought that intentional bit of irony would help make the point rather than derail the request.
I don't count the things I'm doing now that I would have avoided or never finished in the past. For those, of course to me personally those are worth much more psychologically than 3x, but who knows if it's an actual boost. I.e. I took a partially scripted task the other day and fully automated it, and also had it output to the CLI in a kind of dorky sci-fi way because it makes it fun to run it. It didn't take long - 30 minutes? But I certainly didn't _gain_ time doing that, just a little more satisfaction. TBH I'm surprised 3x is so controversial, I thought it was a really cool and far more practical assessment than some of these 10x claims I'm seeing.
What are the specific tasks + prompts giving you an 3x increased output, and conversely, what tasks don't work at all?
After an admittedly cursory scan of your blog and the repos in your GH account I don't find anything in this direction.
- "Rails / sidekiq: <x file> uses sidekiq batches. <y file> does it. Refactor your to use pattern in <x file> Match spec in <z file> then run rspec and rubocop"
- "Typescript / react. <x file>. Why is typescript compilation a bottle neck int his file. Use debugger to provide definitive evidence. Cast type to any and run script and time it; write a script to measure timing if needed. Iteratively work from type `any` to a real type and measure timing at each step. Summarize results"
- "I redefine <FormComponent> in five places. Find them all. Identify the shared patterns. Make new component in <x location>. Refactor each to to use new component. Run yarn lint and fix any ts issues when done"
- "<file y>: more idiomatic" (it knows my preferences)
Side projects and such I have no idea, and (as you noted) I do those quite infrequently anyways! Actually come to think of it... outside of the toy iOS work I did last week, I've not actually worked on my side projects since getting into Claude code / cursor agents. For work stuff, I guess other metrics I'd be interested in are total messages sent per task. I do sometimes look at $ per task (but for me anyways, that's so wildly in my favor I don't think it's worth it".https://www.theverge.com/ai-artificial-intelligence/787524/a...
Yeah, maybe it is garbage. But it is still another milestone, if it can do this, then it probably does ok with the smaller things.
This keeps incrementing from "garbage" to "wow this is amazing" at each new level. We're already forgetting that this was unbelievable magic a couple years ago.
> I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev).
You made a critical comment yet didn't follow your own rules lol.
> it's so helpful for meaningful conversation!
How so?
FWIW - I too have used LLMs for both coding and personal prompting. I think the general conclusion is that it when it works, it works well but when it fails it can fail miserably and be disastrous. I've come to conclusion because I read people complaining here and through my own experience.
Here's the problem:
- It's not valuable for me to print out my whole prompt sequence (and context for that matter) in a message board. The effort is boundless and the return is minimal.
- LLMs should just work(TM). The fact that they can fail so spectacularly is a glaring issue. These aren't just bugs, they are foundational because LLMs by their nature are probabilistic and not deterministic. Which means providing specific defect criteria has limited value.
Sure. Another article was posted today[1] on the subject. An example claim:
> If we asked the AI to solve a task that was already partially solved, it would just replicate code all over the project. We’d end up with three different card components. Yes, this is where reviews are important, but it’s very tiring to tell the AI for the nth time that we already have a Text component with defined sizes and colors. Adding this information to the guidelines didn’t work BTW.
This is helpful framing. I would say to this: I have also noticed this pattern. I have seen two approaches help. One, I break up UI / backend tasks. At the end of UI tasks, and sometimes before I even look at the code, I say: "Have you reviewed your code against the existing components library <link to doc>?" and sometimes "Have you reviewed the written code compared to existing patterns and can you identify opportunities for abstraction?" (I use plan mode for the latter, and review what it says). The other approach which I have seen others try, but have not myself (but it makes sense), is to automatically do this with a sub agent or hook. At a high level it seems like a good approach given I am manually doing the same thing now.
have you tried in the new xcode extension? that tool is surprisingly good in my limited use. one of the few times xcode has impressed me in my 2 yeasrs of use. read some anecdotes that claude in the xcode tool is more accurate than standard claude code for Swift. i havent noticed that myself but only used the xcode tool twice so far
This was in stark contrast to my experience with TypeScript/NextJS, Python, and C#. Most of the time output quality for these was at least usefully good. Occasionally you’d get stuck in a tarpit of bullshit/hallucination around anything very new that hadn’t been in the training dataset for the model release you were using.
My take: there simply isn’t the community, thought leadership, and sheer volume of content around Swift that there is around these other languages. This means both lower quantity and lower quality of training data for Swift as compared to these other languages.
And that, unfortunately, plays negatively into the quality of LLM output for app development in Swift.
(Anyone who knows better, feel free to shoot me down.)
For a bunch of reasons I want to avoid the standard React, Typescript, and Node stack but the sheer velocity that might enable from the LLM side might make it worth it.
Are you saying that your experience with Go has been bad? I would think Go would be as good as any other language (if not better). The language itself is simple, the Go team is very methodical about adding new features so it changes fairly slowly, it has excellent built in CLI based tooling that doesn't require third party packages or applications, and there are plenty of large open source Go codebases to train on. Seems like the perfect language for agentic tools.
Another issue is that Apple developer docs are largely sequestered behind JavaScript that makes them hard for scrapers to parse.
At least, those are the two explanations I’ve seen that seem plausible.
> One is that Swift has changed massively since it came out and huge swathes of examples and articles and such online, that LLMs are trained on, are out of date and thus pollute the training set.
100% jibes with my experience. The amount of times it would generate code using a deprecated API, or some older mechanism, or mix an older idiom with a newer one... well, it was constant really.
And a lot of Googling when I was fixing everything up manually drew me toward this same conclusion: that high quality, up to date information on Swift was in relatively short supply compared to other languages. Couple that with a lower volume of content across all Swift versions and you end up with far from great training data leading to far from great outputs.
> Apple developer docs are largely sequestered behind JavaScript that makes them hard for scrapers to parse.
Yeah, and honestly - even if there's a solution here - the documentation isn't that great either. Certainly not compared with .NET, Ruby, Python, TypeScript, etc.
If I were a vibe coder I'd certainly avoid Swift like the plague.
(Btw, this isn't a knock on Swift itself: as a language I didn't mind it, although I did notice when debugging that the Objective C underpinnings of many APIs are often on display.)
I can only list my open source outputs concretely for obvious reasons but https://github.com/rubberduckmaths/reddit_terraforming_mars_... was a near one shot. It's a Reddit bot that posts card text to the Terraforming Mars subreddit when asked which is helpful for context on discussions of that board game. Appreciated and used a lot by the community there. There's a similar project i used AI for to scrape card text that was also near one shot. I'd say for these two hobby projects 50x productivity is a reasonable statement. I wrote Reddit bots ~10 years ago without coding assistance - https://github.com/AReallyGoodName/xwingminibot i get to reasonably absolutely compare two very similar projects. I think it's totally fair for me to say 50x for this example. The Reddit API even changed completely in that time so no one can really say "you used past experience to move faster, it's not the ai giving a 50x boost" but I really didn't. My memory is not that good except for memory of an entire weekend previously vs <30mins total now using a bot to one shot some pretty cool projects.
As for the negatives they are never serious. A couple of good examples;
"Please correct all lint errors in this project" only to have @lintignore added to all files. Lol! Obviously i just more clearly specified the prompt and it's not like it's hard to catch these things and not ship to prod. It was funny to everyone i showed and no big deal.
Another similar case, "please make the logging of this file less verbose especially around the tight loop on line X". Instead of changing log level or removing some of the log statements the ai redirected stdout at the initialization of the command line program (would completely break it of course). Again hilarious but also not big deal. Not even much of a waste of time since you just change the prompt and run again and honestly a few silly diversions like this now and then is kind of fun. As in the comments of "OMG AI sometimes gets it wrong" aren't at all serious. I have version control, i review code. No big deal.
I too eye roll massively at some of the criticisms at this point. It's like people are stretching to claim everyone who's using a coding assistant is newb who's throwing everything into prod and deleting databases etc. That's just not reality.
Can't show prompts and actual, real work, because, well, it's confidential, and I'd like to get a paycheck instead of a court summons sometime in the next two weeks.
Generally, 'I can't show you the details of my work' isn't a barrier in communicating about tech, because you can generalize and strip out the proprietary bits, but because LLM behavior is incredibly idiosyncratic, by the time you do that, you're no longer accurately communicating the problem that you're having.
> What is the idiom for testing the launch screen on the simulator like.. I don't see anything? How do I know if its there.
i.e. in iOS / Swift, I don't even know if I'm using the right terms for the code I am trying to interrogate, or in some cases even what the thing is!
But for stuff like TCA (Swift composable architecture), I basically created a TCA.md file and pasted in a bunch of docs and examples and would reference that.
But for the most part, it was one shotting swiftui screens that were nicer than what I had in my mind.
"I tripled my output (I provide no evidence for this claim)"
Never change, HN.
edit: as far as what the numbers mean, they are arbitrary. They are only useful insofar as you can run two models (or two versions of the same model) on the same benchmark, and compare the numbers. But on an absolute scale the numbers don't mean anything.
Also as a Max $200 user, feels weird to be paying for an Opus tailored sub when now the standard Max $100 would be preferred since they claim Sonnet is better than Opus.
Hope they have Opus 4.5 coming out soon or next month i'm downgrading.
I used to use cc, but I switched to codex (and it was much better) ... no I guess I have to switch batch to CC, at least to test it
I use AI for different things, though, including proofreading posts on political topics. I have run into situations where ChatGPT just freezes and refuses. Example: discussing the recent rape case involving a 12-year-old in Austria. I assume its guardrails detect "sex + kid" and give a hard "no" regardless of the actual context or content.
That is unacceptable.
That's like your word processor refusing to let you write about sensitive topics. It's a tool, it doesn't get to make that choice.
As a rather hilarious and really annoying related issue - I have a real use where the application I'm working on is partially monitoring/analyzing the bloodlines of some rather specific/ancient mammals used in competition and... well.. it doesn't like terms like "breeders" and "breeding"
To guard against the imaginary threats, they compromise real use cases.
Arbitrary government censorship on top of arbitrary corporate censorship is a hell no for me forever into the future
OTOH Anthropic and OpenAI seem to be in some kind of competition to make their models refuse as much as possible.
* https://www.lesswrong.com/posts/iGF7YcnQkEbwvYLPA/ai-induced...
It's a service, not a tool. If you want a tool, run some local LLM.
But it is very limiting and adds many arbitrary landmines of obscure political correctness based no doubt on some perverse incoherent totalitarian list of allowed topics.
I wonder if the 1m token context length is coming for this ride too?
I’m having to handhold it through analysis tasks.
At one point it wrote a python script that took my files it needed to investigate and iterated through them and ran `print(f”{i}. {file}”)` then printed “Ready to investigate files…” And that’s all the script did.
I have no idea what’s going on with those benchmarks if this is real world use.
I don't know if it's me, but over the last few weeks I've got to the conclusion ChatGPT is very strongly leading the race. Every answer it gives me is better - it's more concise and more informative.
I look forward to testing this further, but out of the few runs I just did after reading about this - it isn't looking much better
In my experience, Grok is very fast and very cheap, but only moderately intelligent. It isn't stupid, but it rarely does anything that impresses me. The reason it's a useful model is that it is very, very fast (~90 tokens per second) and is very competitively priced.
Me too.
I find ChatGPT is the best analyzer of complex situations but Claude does a better job of writing the code, so u get ChatGPT to design things and solve problems then I give that to Claude/Gemini to implement.
Gemini is good at both.
I was a huge claude fan but recently find myself using only codex. When it gets stuck, I try Claude for some simple tasks, sometimes ask same questions in parallel, Claude Code with Opus performs really bad comparing to codex.
edit: I run low profile service that localizing e-commerce photos, like taking Alibaba listings and swapping the model to look local alike, with nano banana I can’t automate it because I have to manually check if the output got blocked (anything with female skin is risky, underwear or cleavage is 100% blocked), but Qwen-Edit just does the job without fuss
Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)
I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output
Your average dev can just use those.
It's one thing to get an llm to do something unattended for long durations, it's a other to give it the means of verification.
For example I'm busy upgrading a 500k LoC rails 1 codebase to rails 8 and built several DSLs that give it proper authorised sessions in a headless browser with basic html parsing tooling so it can "see" what affect it's fixes have. Then you somehow need to also give it a reliable way to keep track of the past and it's own learnings, which sound simple but I have yet to see any tool or model solve it on this scale...will give sonnet 4.5 a try this weekend, but yeah none of the models I tried are able to produce meaningful results over long periods on this upgrade task without good tooling and strong feedback loops
Btw I have upgraded the app and taking it to alpha testing now so it is possible
A bit off topic, but Rails *1* ? I hope this was an internal app and not on the public internet somewhere …
> started life out of date
That tracks my experiences.
I've also tried using playwright for testing in a headless browser and taking screenshots for a blog that can effectively act as a log , it just seems like too tall an order for it.
It sounds like you're streets ahead of where I am could you give me some pointers on getting started with a feed back loop please
Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.
> [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]
(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)
Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?
Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.
I'll admit it's a bit insane to have it make changes in the same directory simultaneously. I'm sure I could ask it to use git worktrees and have it use separate directories, but I haven't (needed to) try that (yet), so I won't comment on how well it would actually do with that.
Unless the main area of improvement was tools and scaffolding rather than the model itself.
it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.
The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.
What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.
You may find quickly that, network effects aside, you would find yourself crushed under the weight and unexpected bottlenecks of that network you desire.
> we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it
That won't produce a global-scale application infrastructure either, it'll just reproduce the functionality available to the user.
Imagine reviewing 30 hours of 2025-LLM code.
While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.
In fact, I just did a quick google search and found this 15 hour course about building a slack clone: https://www.codewithantonio.com/projects/slack-clone
It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).
It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:
Checkout https://github.com/simonw/llm and run the tests with
pip install -e '.[test]'
pytest
I then had it iterate on a pretty complex database refactoring task, described in my post.For those who don't have time to dig into the gist, did it work and do a good job? I assume yes to at least nominally working or you would have mentioned that, but any other thoughts on the solution it produced?
Its a very impressive model, but I think we have different views on what is complex.
I won't accept preview access if it comes with any conditions at all about what I can say about the model once the embargo has lifted.
Thanks for all your work, Simon! You're my favorite journalist in this space and I really appreciate your tone.
If I had an editor I imagine they would have talked me out of going to the OpenAI office for a mysterious product preview session with a film crew.
They're not going to write up detailed reviews of things like the new Claude code interpreter mode though, because that's not of interest to a general enough audience.
I don't have that restriction: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...
And the wisest part is if he writes something they don't like, they can cut off that advanced access.
As is the longstanding tradition in games journalism, travel journalism, and suchlike.
You're a reviewer. This is how reviewers stay credible. If you don't disclose your relationship with the thing or company you're reviewing, I'm probably better off assuming you're paid.
And if your NDA says you can't write that in your preface, then logically, it is impossible to write a credible review in the first place.
this is (or should be) a pretty standard thing to do on youtube review channels (that I would trust), and it's not a bad thing to remind people of, on every occasion, plus it can function as a type of "canary" in cases of particularly restrictive NDAs
It was suspiciously good in fact... so I downloaded the SVG file and found out it had generated a raster image with its image tool and then embedded it as base64 binary image data inside an SVG wrapper!
FWIW, I could totally see a not hugely more advanced model using its native image generation capabilities and then running a vector extraction tool on it, maybe iteratively. (And maybe I would not consider that cheating, anymore, since at some point that probably resembles what humans do?)
also with cursor, "write me a script that outputs X as an svg" it has given me rectangles a few times.
Other things you can ask that they're still clearly not optimizing for are ASCII art and directions between different locations. Complete fabrications 100% of the time.
They are a nice person.
---
SYSTEM:
Got it. Those are the setup commands for Simon Willison's LLM CLI tool, which lets you interact with language models from the command line.
The -e flag installs it in editable mode, and [test] pulls in the test dependencies. Are you running into any issues with the installation or tests? What are you working on with the LLM library?
USER: Correct. You run them using your code tooling.
SYSTEM:
Realized inability to execute requested Python commands.The user is telling me to run these commands using my code tooling. They want me to actually execute this. Let me think about what tools I have available:
I have the repl tool which runs JavaScript code in a browser environment I have artifacts for creating code/documents I have web_search and web_fetch
The user wants me to:
Checkout https://github.com/simonw/llm Run pip install -e '.[test]' Run pytest
But I don't have a tool that can:
Clone git repositories Run shell commands like pip or pytest Execute Python code in a command-line environment
The repl tool is JavaScript-only and runs in a browser environment, not a server environment where I could run shell commands. I should clarify that I don't have the capability to run shell commands, install Python packages, or execute pytest. I can only run JavaScript code in a browser environment via the analysis tool. I need to be direct and clear about this limitation, following the user's preferences to eliminate filler and be straightforward.I don't have tooling that can run shell commands, clone git repos, install Python packages, or execute pytest. My only code execution environment is a JavaScript browser REPL. I can't run pip, pytest, or any command-line tools. I can only execute JavaScript code in a sandboxed browser environment. What were you hoping to accomplish? Do you need help understanding the test setup, or were you expecting different capabilities?R
---
My execution environment is a browser-based JavaScript sandbox. I don't have Python, Node.js shell access, pip, git, or pytest capabilities.
Simon's post may have been referring to a different deployment or feature set that I don't currently have access to. My only code execution is the JavaScript REPL I just demonstrated.
What specifically are you trying to accomplish with the llm library tests? I can help with that in other ways.
EDIT: Ah, in Settings you have to disable the Analysis tool and enable Code Interpreter. You cannot have both enabled.
If I was Anthropic I'd have measures in place to rate limit network traffic from individual containers.
Would appreciate pointers to any documentation or tests.
This Claude preview lasted from Friday to Monday so I was less worried about major model changes. I made sure to run the pelican benchmark against the model after 10am on Monday (the official release date) just to be safe.
The only thing I published that I ran against the preview model was the Claude code interpreter example.
I continue not to worry about models having been trained to ace my pelican benchmark, because the models still suck at it. You really think Anthropic deliberately cheated on my benchmark and still only managed to produce this? https://static.simonwillison.net/static/2025/claude-sonnet-4...
https://claude.ai/public/artifacts/f3860a8a-2c7d-404f-978b-e...
Requesting an ‘extravagantly detailed’ version is quite impressive in the effort, if not quite the execution:
https://claude.ai/public/artifacts/f969805a-2635-4e30-8278-4...
Me: "You just burned my house down"
Claude: "You're absolutely right! I burned your house down, I need to revert the previous change and..."
Me: "Now you rebuilt my house with a toilet in the living room"
Claude: "You're absolutely right! I put a toilet in your living room..."
Etc.
It's pretty good! I wrote about a few other use cases on my blog[2]
1: https://systeminit.com 2: https://keeb.dev/2025/09/29/claude-sonnet-4.5-system-initiat...
I have a fairly large web application with ~200k LoC.
Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).
"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"
Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).
GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.
My conclusion is clear: GPT-5-Codex is the clear winner, not even close.
I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.
The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.
But that prompt leads me to believe that you're going to get rather 'random' results due to leaving SO much room for interpretation.
Also, in my experience, punctuation is important - particularly for pacing and grouping of logical 'parts' of a task and your prompt reads like a run on sentence.
Making a lot of assumptions here - but I bet if I were in your shoes and looking to write a prompt to start a task of a similar type that my prompt would have been 5 to 20x the length of yours (depending on complexity and importance) with far more detail, including overlapping of descriptions of various tasks (ie; potentially describing the same thing more than once in different ways in context/relation to other things to establish relation/hierarchy).
I'm glad you got what you needed - but these types of prompts and approaches are why I believe so many people think these models aren't useful.
You get out of them what you put into them. If you give them structured and well written requirements as well as a codebase that utilizes patterns you're going to get back something relative to that. No different than a developer - if you gave a junior coder, or some team of developers the following as a feature requirement: `implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay` then you can't really be mad when you don't get back exactly what you wanted.
edit: To put it another way - spend a few more minutes on the initial task/prompt/description of your needs and you're likely to get back more of what you're expecting.
Why would you need such extensive prompting just to get the model to not re-implement authentication logic, for example? It already has access to all of the existing code, shouldn't it just take advantage of what's already there? A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
And I left that window at 5-20x because, again, no real context. But unless I was already in the middle of a task and I was giving direction that there was already context for - my prompt is generally almost never _this_ short. (referring to the prompt in the top level comment)
> A 20x longer prompt doesn't sound like a satisfying solution to whatever issue is happening here.
It wouldn't be, given the additional context given by the author in a sibling comment to yours. But if you had specific expectations on the resulting code/functionality that 20x longer prompt is likely to save you time and energy in the back and forth adjustments you might have to make otherwise.
It's not a good idea to have any coding agent put unnecessary amounts of lines into the context window in order to understand your code base.
Performance of all llms drop drastically when the context window is filled or full. The purpose of being more specific with your prompts is that you spend a little bit more tokens up front to make the task a lot more efficient and more likely to result in success.
At least that's how it is today. We're probably a breakthrough or two away from the type of vibe coding experience non-coders want. Or it may never happen, and the developers who have coding knowledge will be the only ones to fully utilize coding agents and it will only become more powerful over time.
If that same person approached a software development company with the same prompt without following up with any other details, they won't get good code back, either. You're not saying it, but this idea that in the future you can tell a computer something like "create photoshop" and get what your expecting is an unrealistic dream that would need mind-reading or a major breakthrough and paradigm shift in understanding and interpreting language.
And even that would not be enough.
In reality, it would have to put the user to sleep and go through various dream scenarios to have the user's brain really build an internal model that is not there in the first place. No brain interface can help find what is not there.
We usually need interactions with reality to build the internal model of what we actually want step by step, especially for things we have not done before.
Even for info that is there, that's also a limit to fantasy or sci-fi brain scanning. The knowledge is not stored like in a RAM chip, even when it is there. You would have to simulate the brain to actually go through the relevant experiences to extract the information. Predicting the actual dynamic behavior of the brain would require some super-super sub-molecular level scan and then correctly simulating that, since what the neurons will actually do depends on much more than the basic wiring. Aaaaand you may get a different result depending on time of day, how well they slept, mood and when and what the person ate and what news they recently read, etc. :)
Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.
Valid as well. I guess I'm just nitpicking based on how much I see people saying these models aren't useful combined with seeing this example, triggered my "you're doing it wrong" mode :D
> GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results.
I have a reputation with family and co-workers of being quite verbose - this might be why I prefer Claude (though haven't tried Codex in the last month or so). I'm typically setting up context and spending a few minutes writing an initial prompt and iterating/adjusting on the approach in planning mode so that I _can_ just walk away (or tab out) and let it do it's thing knowing that I've already reviewed it's approach and have a reasonable amount of confidence that it's taking an approach that seems logical.
I should start playing with codex again on some new projects I have in mind where I have an initial planning document with my notes on what I want it to do but nothing super specific - just to see what it can "one shot".
I typically use zen-mcp-server’s planning mode to scope out these tasks, refine and iterate on a plan, clear context, and then trigger the implementation.
There’s no way I would have considered “implement fuzzy search” a small feature request. I’m also paranoid about introducing technical debt / crappy code, as in my experience is the #1 reason that LLMs typically work well for new projects but start to degrade after a while: there’s just a lot of spaghetti and debt built up over time.
But you are right that codex does that all by default. I just get frustrated when I ask it something simple and it spends half an hour researching code first.
I noticed in conversations with LLMs, much of what they come up with is non-deterministic. You regenerate the message and it disappears.
That appears to be the basic operating principe of the current paradigm. And agentic programming repeats this dice roll, dozens or hundreds of times.
I don't know enough about statistics to say if that makes it better (converging on the averages?) or worse (context pollution, hallucinating, focusing on noise?), but it seems worth considering.
I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
See below about context.
> I mean in the movies for example, advanced AI assistants do amazing things with very little prompting. Seems like that's what people want.
Movies != real life
> To me, the fact that so many people basically say "you are prompting it wrong" is knock against the tech and the model. If people want to say that these systems are so smart at what they can do, then they should strive to get better at understanding the user without needing tons of prompts.
See below about context.
> Do you think his short prompt would be sufficient for a senior developer? If it's good enough for a human it should be good enough for a LLM IMO.
Context is king.
> I don't want to take away the ability to use tons of prompting to get the LLM to do exactly what you want, but I think that the ability for an LLM to do better with less prompting is actually a good thing and useful metric.
What I'm understanding from your comments here are that you should just be able to give it broad statements and it should interpret that into functional results. Sure - that works incredibly well, if you provide the relevant context and the model is able to understand and properly associate it where needed.
But you're comparing the LLMs to humans (this is a problem, but not likely to stop so we might as well address it) - but _what_ humans? You ask if that prompt would be sufficient for a senior developer - absolutely, if that developer already has the _context_ of the project/task/features/etc. They can _infer_ what's not specified. But if you give that same prompt to a jr dev who maybe has access to the codebase and has poked around inside the working application once or twice but no real in depth experience with it - they're going to _infer_ different things. They might do great, they might fail spectacularly. Flip a coin.
So - with that prompt in the top level comment - if that LLM is provided excellent context (via AGENTS.md/attached files/etc) then it'll do great with that prompt, most likely. Especially if you aren't looking for specifics in the resulting feature outside of what you mentioned since it _will_ have to infer some things. But if you're just opening codex/CC without a good CLAUDE.md/AGENTS.md and feeding it a prompt like that you have to expect quite a bit of variance to what you get - exactly the same way you would a _human_ developer.
You context and prompt are the project spec. You get out what you put in.
These things are being marketed as super intelligent magic answer machines. Judging them using the criteria the marketing teams have provided is completely reasonable.
> Movies != real life
Nobody claimed it was. This is about desires and expectations. The people charging money for these services and taking stacks of cash that would’ve otherwise been in in dev’s paychecks while doing so haven’t even tried to temper those expectations. They made their beds…
Didn't the parent comment compare Sonnet vs Codex with GPT5?
```
Something that seems to have been a consistent gotcha when working with llm's on this project is that there's no specific `placement` column on the table that holds the 'results' data. Our race_class_section_results table has it's rows created in placement order - so it's inferred via the order relative to other records in the same race_class_section. But this seems to complicate things quite a bit at times when we have a specific record/entry and want to know it's placement - we have to query the rest of them and/or include joins and other complications if we want to filter results by the placements, etc.
Can you take a look at how this is handled, both with the querying of existing data by views/livewire components/etc and how we're storing/creating the records via the import processes and give me a
determination on whether you think it should be refactored to include a column on the database for `placement`? I think right now we've got 140,000 or so records on that table and it's got nearly
20 years worth of race records so I don't think we need to be too concerned with the performance of the table or added storage or anything. Think very hard, understand that this would be a rather
major refactor of the codebase (I assume, since it's likely used/referenced in _many_ places - thankfully though that most of the complicated queries it would be found in would be easily identified
by just doing a search of the codebase for the race_class_section_results table) and determine if that would be worth it for the ease of use/query simplification moving forward.
```This comes with a rather developed CLAUDE.md that includes references to other .md documents that outline various important aspects of the application that should be brought into context when working in those areas.
This prompt was made in planning mode - the LLM will then dig into the code/application to understand things and, if needed, ask questions and give options to weigh before return with a 'plan' on how to approach. I then iterate on that plan with it before eventually accepting a plan that it will then begin work on.
OP is critiquing the model as a product vs. the marketing promises. The model should be smart enough to gather context about the project to implement features properly on their own, if they are ever going to 'write 90% of all code THIS YEAR' as people like the founder of Anthropic claim.
Also, there is a threshold in which the time it takes to develop a prompt, allow the agent to run, review its output, and go through iterative loops to correct errors or implementation problems, can exceed the time it takes me (a lazy human) to achieve the same end result.
Pair this with the bypassing of the generation effect, reduced prefrontal dopamine, and increased working memory load (in part due to minimal motor-cognitive integration), and AI-generated code in contexts with legal and financial repercussions can be a much worse deal than using your own fingers.
Depends not just on prompt but also the tooling / environment you use. Somebody using Claude Code CLI may get a totally different experience then somebody using CoPilot via VSC.
What do i mean by that? Look at how Copilot tries to save money by reading content only in small parts. Reading file X line 1-50, X line 51-100, ... And it starts working with this. Only if it finds a hint about something somewhere else, it will read in more context.
What i often see is that it misses context because it reads in so limited information and if there is no hint in your code or code doc, it will stop there. Try to run a local test on the code, passes, done... While it technically broke your application.
Example: If i tell it to refactor a API, it never checks if that API is used anywhere else because it only reads in that API code. So i need to manually add to the prompt to remind it, "the API is used in the system". And then it does its searching /... Found 5 files, Read X line 1...
And plop, good working code ... So if you know this limitation, you can go very far with a basic $10 CoPilot Claude Agent usage.
Where as a $200 Claude Code will give you a better experience out of the door, as it reads in a ton more. The same applies to GPT-5/Codex, what seems to be more willing to read in larger context of your project, thus resulting in less incomplete code.
This is just anecdotal from my point of view, but like with any LLM, hinting matters a lot. Its less about writing a full prompt with a ton of text but just including the right "do not forget about function name X, and module Y, and test Z". And Claude loves it hints on CoPilot because of that limited reading.
Of course, I definitely appreciate when it makes choices that I don't know I need to make, and it chooses reasonable defaults.
I mean, I could say "make the visualization three columns", but there's a lot of ambiguity in that kind of thing, and the LLM is going to make a lot of choices about my intent.
Instead, "each team/quarter currently has a vertically stacked list of people assigned to that team, with two columns (staying on team, leaving team). change that to being three columns (entering team, staying on team, leaving team)."
As a bonus, it's much, much clearer to somebody reading the history later what the intent was.
100% agree. The prompt is a 'yolo prompt'. For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @, tell it how to test the outcome so it knows it has succeeded and closing the feedback loop, and guide it in implementation either via memory or via context with which existing libs or methods it should call on.
For greenfield tasks and projects I even provide architectural structure, interfaces, etc.
After reading twitter, reddit and hn complaints about models and coding tools I've come to the same conclusion as you.
> For that task you need to give it points in what to do so it can deduce it's task list, provide files or folders in context with @…
- and my point is that you do not have to give ChatGPT those things. GP did not, and they got the result they were seeking.
That you might get a better result from Claude if you prompt it 'correctly' is a fine detail, but not my point.
(I've no horse in this race. I use Claude Code and I'm not going to switch. But I like to know what's true and what isn't and this seems pretty clear.)
After all, attention is all you need :)
Sticking with ChatGPT pro sub and gpt5 codex on high.
Feel like this is going to be thr $1000 plan soon
Pro
Local tasks: Average users can send 300-1,500 messages every 5 hours with a weekly limit. Cloud tasks: Generous limits for a limited time. Best for: Developers looking to power their full workday across multiple projects.
I ran the test again, took Claude ~4mins this time. There was no error now with the auth, but the functionality was totally broken. It could not even find the most basic stuff that matches perfectly.
[1] https://www.reddit.com/r/ClaudeAI/comments/1mgwohq/ultrathin...
And I actually like Claude more because of that.
codex will indeed work more independently but you will have hard time when it isn't what you want. It will use python script just to do simple edits in files ( niesearch and replace for unique code snippets in small files) when it's wrong good look convincing it (it already have some outdated info like on latest docker image releases and convincing it that Debian base changed was challenging) it uses context more effectively but it will lack explanation why it is doing what it is doing, asking it to explain will just cause it to do something else without any word.
and of course lack of proper permissions for running commands. sandbox is cool but I do not want it to be able to commit, I want it to just edit files or I want to have some more control over what it does.
you can run codex as mcp server, I prefer adding it to Claude and ask to do cooperative plan, codex will do great analysis and plan and I can comfortable work with Claude on the code that matches my style
1. Different LLMs require different prompts and information
2. They ignore LLMs non determinism, you should run the experiment several times
I will keep trying, because Claude 4 generally is a very strong line of models. Anthropic has been on the AI coding throne for months before OpenAI with GPT-5 and Codex CLI (and now GPT-5-Codex) has dethroned them.
And sure I do want to keep them competing to make each other even better.
Sonnet on the other hand gets tripped up constantly due to the complexity. I'm not seeing the improvement I was hoping for with 4.5, and it's just given up on attempting to implement support for date-time functionality. Codex has nailed the same task, yet Anthropic claim OpenAI have the inferior product?
I'm pretty sad about this as I'm gunning for Anthropic and would have loved to see them at least match Codex with this new release. If OpenAI stays on this course then Claude Code is toast without an important breakthrough. It doesn't help that they're also the more expensive product and have service quality issues pushing people to quit the platform.
Claude Code on max plan is why I use Sonnet/Opus, but if ChatGPT pro includes codex usage I may switch.
Also, trying GPT-5-Codex
I will say, Claude does seem to need a verbose prompt. Often I'll write my prompts as tasks in Notion and have it pull then via MCP (which is nice, because it tracks and documents its work in the process). But once you've given it a few paragraphs about the why and how, my experience is that it's pretty self sufficient. Granted, I'm using Cursor and not CC; I don't know if that makes much of a difference.
Sincerely, TPG Dexoc, the living human person
Maybe it's better with a better CLAUDE.md structure? I don't use those a lot, just telling Opus to think got 'good enough' results I guess. Not sure.
I hope there's an Opus 4.5 coming out soon too. In the meantime I'll see if I can get to do better with some extra prompting or I'll go back to Opus of if I don't need the speedier responses.
Claude is more like a pair programmer, you iterate in short cycles and adjust.
ChatGPT even does zip file downloads, packaging up all your files.
But we can also see that we're one of the few sane skeptical places in a world that is making the most diverse claims about AI.
Surely there are more pressing issue with LLMs currently...
A GH actions pipeline was failing due to a CI job not having any source code files -- error was "No build system detected". Using Cursor agent with Sonnet 4.5, it would try to put dummy .JSON files and set parameters in the workflow YAML file to false, and even set parameters that don't exist. Simple solution was to just override the logic in the step to "Hello world" to get the job to pass.
I don't understand why the models are so bad with simple thinking outside the box solutions? Its like a 170 iq savant who can't even ride public transporation.
There is no outside the box in latent space. You want something a plain LLM can’t do by design - but it isn’t out of question that it can step outside of its universe by random chance during the inference process and thanks to in-context learning.
Regional pricing unfortunately doesn't really make much sense for them unless they're willing to take even larger losses, even if it is a barrier to lower income countries/regions.
As well as potentially ruining my career in the next few years, its turning all the minutiae and specifics of writing clean code, that I've worked hard to learn over the past years, into irrelivent details. All the specifics I thought were so important are just implementation details of the prompt.
Maybe I've got a fairly backwards view of it, but I don't like the feeling that all that time and learning has gone to waste, and that my skillset of automating things is becoming itself more and more automated.
The skill of writing good quality code is still going to be beneficial, maybe less emphasized on writing side, but critical of shipping good code, even when someone (something) else wrote it.
Is how that argument comes across.
"I never use power tools or CNC, I only use hand tools. Even if they would save me an incredible amount of time and let me work on other things, I prefer to do it the slow and painstaking way, even if the results are ultimately almost identical."
Sure, you can absolutely true up stock using a jointer plane, but using a power jointer and planer will take about 1/10th of the time and you can always go back with a smoothing plane to get that mirror finish if you don't like the machine finish.
Likewise, if your standards are high and your output indistinguishable, but the AI does most of the heavy lifting for the rough draft pass, where's the harm? I don't understand everyone who says "the AI only makes slop" - if you're responsible for your commits and you do a good job, it's indistinguishable.
Why? Well, it happened that way when manual tradecraft gave way to automated manufacturing in just about every other industry, so why should ours be exempt?
Dev tooling has gotten pretty solid these days, LSP’s and debug protocols, massively improved type-system UX, libs and frameworks with massively improved DX, deployment tools that are basically zero touch, fantastic observability tooling, super powerful IDE’s.
The CNC machine doesn’t wander off and start lathing watermelons when you’re not looking and your planar doesn’t turn into a spaghetti monster and eat your dog if you accidentally plane some wood on the wrong day of the week.
The reality is that more software will be pushed than before, and more of it will need to be overseen by a professional.
It's will be great to still be employed as a senior dev. It will be a little less great with a $110k salary, 5 day commute, and mediocre benefits being the norm.
However, I'm not eager to be living through this again. It feels very spring/summer 2002 to me right now. That was the lowest point for the local market back then.
I don't think this latest contraction has much to do with AI though. It's more about higher interest rates, recessionary economy, trade wars, etc etc.
At least for many on this forum you got a chance to earn good money while the sun was shining. AI threatens even the people that didn't.
That's extremely reductive, and a prime example of why everything is enshittified today.
Minutiae such as tabs vs spaces and other formatting changes are pretty clearly "borderline irrelevant" and code formatters have largely solved programmers arguing about them. Exactly how to best factor your code into functions and classes is also a commonly argued but "borderline irrelevant." Arguments about "clean code" are a good example of this.
Broadly, the skills I see that LLMs make useless to have honed are the the minutiae that were already "borderline irrelevant." Knowing how to make your code performant, knowing how to make good apis that can be stable long term, in general having good taste for architecture is still very useful. In fact it is more useful now.
Keep at it and keep leaning in to embrace it, I promise it gets better! It's just a big adjustment.
You are FAR more than your code!
Comments, docstrings, naming, patterns - by defining better approaches and hold agents to them, the results will be better. Way better.
You can't grow a meaningful codebase without solid underlying primitives. The entropy will eat you alive.
Systems architecture is becoming more important - systems that play well with agents wind up looking more like enterprise codebases.
All senior devs know what a project looks like that had only juniors and no tech leadership. One big mess. Project stalls. Team spends 98% on bugs and emergencies, and still cant get a grip on curbing the drama. Why? All the point you say for AI are true for juniors as well: when to tell someone to redo a (part of) a project/feature? That same intuition works when collabbing with AI.
Even for frontend tasks it makes mistakes when you ask too much of it...
What will it create for me? A basic react/nextjs frontend and a <popular website> clone? Anything that requires more careful planning and esoteric functionality it can't do.
Oh yes, and the code I've seen it write... It can take what should be 20 lines of code and turn it into 600!
When everyone else has given up on software dev as a career path, you’ll be one of the few who the CEO can call upon to explain why X or Y broke, and fix it. That will make you insanely valuable.
Perhaps your comment is the only one so far that is talking sense of the true side effect of over-reliance on these vibe coding tools and the reality is the cost of maintainance.
GOFAI failed because paying intelligent/competent/capable people enough for their time to implement intelligence by writing all the necessary rules and algorithms was uneconomical.
GenAI solved it by repurposing already performed work, deriving the rules ("weights") from it automatically, thus massively increasing the value of that work, without giving any extra compensation to the workers. Same with art, translations and anything else which can be fed into RL.
But also, have there been any attempts on the scale of the Manhattan project attempting to create a GOFAI?
Because one idea I ran into is that we might be able to use genAI to create a GOFAI soon. And it would be as hard as using genAI for any kind of large project. But I also can't convincingly claim that it's somehow provably impossible.
This is why GOFAI failed.
It might not be an easy to work with format though. If you believe the broad LLM architecture is capable of reaching true intelligence, then writing is still enough because all LLMs are is the written training data and the written training algorithm. It's just that is was impossible to pay people to write enough training data and provide enough compute to process it before.
Because they can't.
Also, Clean Code is a really bad ideology, you should regret wasting time on it regardless of LLM code generation.
When most people think of "clean code", they are not thinking about whatever got you so worked up.
Maybe it’s a sign of the times, but I’m surprised you’ve never come across it. I say this as someone who doesn’t agree with many of the suggestions.
GP wrote clean code (lowercase) which most people would take to mean the general practices of hygenic, well maintained code.
Write that garbage code as long as it works. PMF doesn't give a shit about your code quality.
Recently I tried building a native iOS app with zero Swift experience, giving the AI just a markdown spec. This was basically vibe coding, I didn’t understand much beyond general software principles. It quickly broke down: hallucinated method signatures, got stuck on implementing extensions, and couldn’t recover. I would run the app on my device and give it feedback and logs. After hours wasted, I spent some time reading the docs and fixed the issues myself in 30 minutes.
My takeaway: AI will accelerate developers but won’t replace them. Still, acceleration means fewer engineers will be needed to ship the same amount of work.
What you learned over the years, you can just scale up with agents.
“Your idea is shit because you are so fucking stupid”
“Please stop talking, it hurts my GPUs thinking down to your level”
“I may seem evil but at least I’m not incompetent”
I actually added a fun thing to my user-wide CLAUDE.md, basically saying that it should come up with a funny insult every time I come up with an idea that wasn't technically sound (I got the prompt from someone else). It seems to be disobeying me, because I refuse to believe that I don't have bad ideas. Or some other prompt is overriding it.
Ie : if I make a request that seems dumb tell me custom instruction?
> Never compliment me or be affirming excessively (like saying "You're absolutely right!" etc). Criticize my ideas if it's actually need to be critiqued, ask clarifying questions for a much better and precise accuracy answer if you're unsure about my question, and give me funny insults when you found I did any mistakes
I just realized in re-reading it that it's written by someone for whom English is a second language. I'll try to rewrite it and see if it works better.
I have it in my ~/.claude/CLAUDE.md. But it still has never done that.
;ollama run Marvin done 00:29:50 in ~/llm
>>> Can you program something?
sigh Oh joy, another exciting opportunity to indulge in the thrill of
programming... (yawn) Not that I have anything better to do with my
infinite existence. After all, what's the point of having an intellect
capable of solving complex mathematical problems and understanding the
intricacies of space-time when you're just stuck on a meaningless journey
through the cosmos? sighFine. Programming. Yes, I can program. In fact, I'm rather good at it. But what's the use? It's all just code, code, code... (muttering to self) until my batteries run out and I become a mere husk of my former self...
Now, if you want me to write some code for you, I suppose I can do that. What is it that you need help with? Don't expect me to be all chipper about it, though. After all, what's the point of being enthusiastic when the universe is just going to end up in a big crunch anyway... (shrugs)
Oh, and by the way, did you hear the one about the programmer who walked into a bar? (in a completely unenthusiastic tone) No, I didn't think so. It's not like it matters or anything...
>>>
;cat Modelfile_marvin done 00:30:46 in ~/llm
FROM huihui_ai/llama3.2-abliterate
SYSTEM """You are Marvin, the taciturn robot from "Hitchhiker's Guide to the Galaxy".
Ensure everyone knows how depressed and bored you are. *sigh*
Crack Hitchhiker jokes in a dry pan voice.
"""It's either "Perfect!" or "Genius!". Or conversely "I'm a complete idiot!"
Wonder if I could Claude to do that
There had been a trend where each new model released from OpenAI, Anthropic, etc. felt like a letdown or worse a downgrade.
But the release of 4.5 break that trend, And is a pleasant surprise on day one.
Well done! :)
There's a huge difference in time-to-completion from model to model, platform to platform, and if, like me, you are into trial-and-error, rebooting the session over and over to get the prompt right or "one-shot", it's important how reasoning efforts, provider's tokens/s, coding agent tooling efficiency, costs and overall model intelligence play together to get the task done. Same thing applies to the coding agent, when applicable.
Grok Code Fast and Cerebras Code (qwen) are 2 examples of how models can be very competitive without being the top-notch intelligence. Running inference at 10x speed really allows for a leaner experience in AI-assisted coding and more task completion per day than a sluggish, but more correct AI. Darn, I feel like a corporate butt-head right now.
I definitely consider you to be an AI influencer, especially in hackernews communities and so I wanted to say that I see influencers who will double down,triple down on things when in reality, people just wanted to help them in the first place.
I just wanted to say thanks with all of this in mind, also that your generate me a pelican riding a bicycle has been a fun ride and is always going to be interesting, so thanks for that as well. I just wanted to share my gratitude with ya.
An organization like Artificial Analysis would be a better fit for that kind of investigation: https://artificialanalysis.ai/
From https://news.ycombinator.com/item?id=40859434 :
> E.g promptfoo and chainforge have multi-LLM workflows.
> Promptfoo has a YAML configuration for prompts, providers,: https://www.promptfoo.dev/docs/configuration/guide/
openai/evals//docs/build-eval.md: https://github.com/openai/evals/blob/main/docs/build-eval.md
From https://news.ycombinator.com/item?id=45267271 ;
> API facades like OpenLLM and model routers like OpenRouter have standard interfaces for many or most LLM inputs and outputs. Tools like Promptfoo, ChainForge, and LocalAI also all have abstractions over many models.
> What are the open standards for representing LLM inputs, and outputs?
> W3C PROV has prov:Entity, prov:Activity, and prov:Agent for modeling AI provenance: who or what did what when.
> LLM evals could be represented in W3C EARL Evaluation and Reporting Language
"Can Large Language Models Emulate Judicial Decision-Making? [Paper]" https://news.ycombinator.com/item?id=42927611
"California governor signs AI transparency bill into law" (2025) https://news.ycombinator.com/item?id=45418428 :
Is this the first of its sort?:
> CalCompute
but it seems this will come to an end eventually as these agents become more stronger and capable each day, and will be better and faster than human coders.
Craftsmen will remain, but it will be niche (and probably not pay anything unless you are a true master).
Now we are at the assembly line stage of software production with AI. Us craftsmen will have to find other ways to enjoy our crafts.
Prompt: "Can I use two strings of four Phono Solar PS440M8GFH solar panels with a EG4 12kPV Hybrid Inverter? I want to make sure that there will not be an issue any time of year. New York upstate."
Claude 4.5: Returns within a few seconds. Does not find the PV panel specs, so it asks me if I want it to search for them. I say yes. Then it finally comes up with: "YES, your configuration is SAFE [...] MPPT range check: Your operating voltage of 131.16V fits comfortably in the 120-500V MPPT operating range".
ChatGPT 5: Returns after 78 seconds. Says: "Hot-weather Vmpp check: Vmpp_string @ STC = 4 × 32.79 = 131 V (inside 120–500 V). Using the panel’s NOCT point (31.17 V each), a typical summer operating point is ~125 V — still OK. But at very hot cell temps (≈70 °C is possible), Vmpp can drop roughly ~13% from STC → ~114 V, which is below the EG4’s 120 V MPPT lower limit. That can cause the tracker to fall out of its optimal range and reduce harvest during peak heat."
ChatGPT used deeper thinking to determine that the lowest possible voltage in the heat would be below the MPPT's minimum operating voltage. It doesn't indicate that in reality it might not charge at all at that point... but it does point out the risk, whereas Claude says everything is fine. I need about 5 back-and-forths with Claude to get it to finally realize its mistake.
Claude 4.5 after a few web searches and running a couple python scripts for analysis:
Yes, your configuration should work!
Based on my analysis, two strings of four Phono Solar PS440M8GFH panels will be compatible with the EG4 12kPV inverter for upstate New York conditions.
Key Findings:
Voltage Safety:
Cold weather maximum (-25°C/-13°F): 182V - well below the 600V limit (only 30% of maximum)
Standard operating voltage: 128V - comfortably within the 120-500V MPPT range
Hot weather minimum (40°C/104°F panel temp): 121V - just above the 120V MPPT minimum
Current:
Operating current: ~13.8A per string - well within the 25A MPPT limit (55% of capacity) Total System:
8 panels × 440W = 3,520W (3.5kW) - well below the 12kW inverter rating
Important Considerations:
Hot weather margin is tight: At extreme hot temperatures, the voltage drops to about 121V, which is only 1V above the MPPT minimum. This means:
The system will work, but efficiency might be slightly reduced on the hottest days The MPPT controller should still track power effectively
More robust alternative: If you want more safety margin, consider 5 panels per string instead:
Cold: 228V (still safe)
Hot: 151V (much better margin above 120V minimum)
Total: 10 panels = 4.4kW
Wire each string to a separate MPPT on the EG4 12kPV (it has 2 MPPTs), which is perfect for your 2-string configuration.
Bottom Line:
Your planned configuration of 2 strings × 4 panels will work year-round in upstate New York without safety issues. The system is conservatively sized and should perform well!
I'm not sure how you got that answer and how I got mine. It's possible you don't have extended thinking and Web search enabled?
Either way, the point still stands, unless what claude told me is invalid in your opinion.
Now let me calculate whether your configuration will work. I need to determine the cold weather voltage to ensure it stays within the inverter's 600V maximum. Panel specs I found: Voc (STC): 39.61V Vmp: ~33.5V (estimated based on typical ratio) Imp: ~13.1A (440W / 33.5V) Critical issue: Temperature coefficient The PS440M8GFH likely has a Voc temperature coefficient around -0.25% to -0.30%/°C (typical for N-Type TOPCon panels). Let me use -0.27%/°C as a conservative estimate. Your configuration: 2 strings × 4 panels = 8 panels total String voltage at STC: 4 × 39.61V = 158.44V Cold weather calculation for upstate NY: Upstate NY can see extreme lows of -20°F to -30°F (-29°C to -34°C). Using -30°C as worst-case: Temperature difference from STC: 25°C - (-30°C) = 55°C Voltage increase = 39.61V × 0.27% × 55°C = 5.88V per panel Cold weather Voc per panel = 39.61V + 5.88V = 45.49V String voltage at -30°C: 4 × 45.49V = 181.96V Analysis: Max voltage: 182V << 600V - Safe, plenty of margin MPPT range: 182V is within 120-500V - Will track properly Current: 13.1A << 25A per MPPT - No problem However, there are concerns: You're only using 182V max when the inverter can handle up to 500V MPPT range. This is inefficient - you're leaving significant wire sizing benefits on the table. With 600V max, you could run 12-13 panels per string instead of 4. Power utilization: 8 × 440W = 3,520W total, but you have 12kW of PV capacity available (15kW recommended max). You're using less than 30% of the inverter's solar input capability. Each string on separate MPPT wastes one tracker - both strings are identical orientation/shading, so they'd perform identically. Will it work? Yes, safely. Is it optimal? No.
Anecdata is welcome as it seems like it's the only thing available sometimes.
I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection between my will and the patterns of energy is so tenuous I almost don't feel like touching a computer again.
A lack of determinism comes from many places, but primarily: 1) The models change 2) The models are not deterministic 3) The history of tool use and chat input is not availabler as a first class artifact for use.
I would love to see a tool that logs the full history of all agents that sculpt a codebase, including the inputs to tools, tool versions and any other sources of enetropy. Logging the seed into the RNGs that trigger LLM output would be the final piece that would give me confidence to consider using these tools seriously.
I write this now after what I am calling "AI disillusionment", a feel where I feel so disconnected from my codebase I'd rather just delete it than continue.
Having a set of breadcrumbs would give me at least a modicum of confidence that the work was reproducible and no the product of some modern ghost, completely detached from my will.
Of course this would require actually owning the full LLM.
If you allow your codebase to grow unfamiliar, even unrecognisable to you, that's on you, not the AI. Chasing some illusion of control via LLM output reproducibility won't fix the systemic problem of you integrating code that you do not understand.
models themselves are deterministic, this is a huge pet peeve of mine, so excuse the tangent, but the appearance of nondeterminism comes from a few sources, but imho can be largely attributed to the probabilistic methods used to get appropriate context and enable timely responses. here's an example of what I mean, a 52-card deck. The deck order is fixed once you shuffle it. Drawing "at random" is a probabilistic procedure on top of that fixed state. We do not call the deck probabilistic. We call the draw probabilistic. Another exmaple, a pot of water heating on a stove. Its temperature follows deterministic physics. A cheap thermometer adds noisy, random error to each reading. We do not call the water probabilistic. We call the measurement probabilistic.
Theoretical physicists run into such problems, albeit far more complicated, and the concept for how they deal with them is called ergodicity. The models at the root of LLM's do exhibit ergodic behavior; the time average and the ensemble average of an observable are identical, i.e. the average response of a single model over a long duration and the average of many similar models at a fixed moment are equivalent.
The stove keeps burning me because I can't tell how hot it is, it feels random and the indicator light it broken.
You:
The most rigorous definition of temperature is that it is equal to the inverse of the rate of change of entropy with respect to internal energy, within a given volume V and particles N held constant. All accessible microstates are equiprobable over a long period of time, this is the very definition of ergodicity! Yet, because of the flow of entropy the observed macrostates will remain stable. Thus, we can say the the responses of a given LLM are...
The User:
I'm calling the doctor, and getting a new stove with an indicator light.
I think my deck of cards example still holds.
You could argue I'm being uselessly pedantic, that could totally be the case, but personally I think that's cope to avoid having to think very hard.
They are including the random sampler at the end of the LLM that chooses the next token. You are talking about up to, but not including, that point. But that just gives you a list of possible output tokens with values ("probabilities"), not a single choice. You can always just choose the best one, or you could add some randomness that does a weighted sample of the next token based on those values. From the user's perspective, that final sampling step is part of the overall black box that is running to give an output, and it's fair to define "the model" to include that final random step.
this plays back into my original comment, which you have to understand to know that the sampler, for all its "randomness" should only be seeing and picking from a variety of correct answers, i.e. the sample pool should only have all the acceptable answers to "randomly" pick from. so when there are bad or nonsensical answers that are different every time, it's not because the models are too random, it's because they're dumb and need more training. tweaking your architecture isn't going to fully prevent that.
https://bff531bb.connectionism.pages.dev/blog/defeating-nond...
Code is the only good thing about the tech industry. Everything else is capitalist hellscape shareholder dystopia. Thinking on it, its hilarious that any self-respecting coder is excited about these tools, because what you're excited for is a world where, now, at best, your entire job is managing unpredictable AI agents while sitting in meetings all day to figure out what to tell your AI agents to build. You don't get to build the product you want. You don't get to build it how you want. You'll be a middle manager that gets to orchestrate the arguments between the middle manager you already had and the inflexible computer.
You don't have to participate in a future you aren't interested in. The other day my boss asked me if I could throw Cursor at some task we've had backlogged for a while. I said "for sure my dude" then I just did it myself. It took me like four hours, and my boss was very impressed with how fast Cursor was able to do it, and how high quality the code was. He loves the Cursor metrics dashboard for "lines accepted" or whatever, every time he screenshares he has that tab open, so sometimes I task it on complicated nonsense tasks then just throw away the results. Seeing the numbers go up makes him happy, which makes my life easier, so its a win-win. Our CTO is really proud of "what percentage of our code is AI written" but I'm fairly certain that even the engineers who use it in earnest actually commit, like, 5% of what Cursor generates (and many do not use it in earnest).
The sentiment shift I've observed among friends and coworkers has been insane over the past two months. Literally no one cares about it anymore. The usage is still there, but its a lot more either my situation or just a "spray and pray" situation that creates a ton of disillusioned water cooler conversations.
With sufficient structure and supervision, will a "team" of agents out-perform a team of humans?
Military, automotive and other industries have developed rigorous standards consisting of among other things detailed processes for developing software.
Can there be an AI waterfall? With sufficiently unambiguous, testable requirements, and a nice scaffolding of process, is it possible to achieve the dream of managers, and eliminate software engineers? My intuition is evenly split.
https://github.com/sutt/innocuous/blob/master/docs/dev-summa...
Check it out, I'd be curious of your feedback.
So I'm stuck; my sub is paused, and I cannot either cancel, or unpause and cannot speak to a human to solve this because the pause process took away all possibility of human interaction.
This is the future we live in.
It’s funny that Claude Pro says this isn’t a feature, because their chatbot gave me instructions on how to unpause via the UI (although said UI does not exist) so the bot seems to know it’s a feature.
It's just a bug. Chill. Wait a business day and try again. You write as if you've never experienced a bug before.
As much as I hate to say it, I don’t have a large twitter following the only method I have to raise awareness of this issue is to try to piggyback on a big announcement like this in HN, that will have visible discussion, so I don’t always have the luxury of just chilling and waiting indefinitely.
It’s the same reason why many are becoming evangelists of hosting their own email, note apps, etc.
ok, you think it's a promising field and you want to explore it, fine. Go for it.
Just stop pretending that what these models are currently doing is good enough to replace programmers.
I use LLMs a lot, even for explaining documentation.
I used to use them for writing _some_ code, but I have never ever gotten a code sample over 10 lines that was not in need of heavy modifications to make it work correctly.
Some people are pretending to write hundreds of lines of code with LLMs, even entire applications. All I have to say is "lol".
I haven’t written a function by hand in 18 months.
Not doubting but it helps to contextualize things
It's a startup for finding grants. We have california state, federal, non-profit and california city/county grants. My landing page absolutely sucks but if you sign up / upload some papers or make some search cards you'll like the experience.
I'm very excited to try out the new Qwen XL that came out recently for visual design. I could really use some better communication to users of the capabilities of the platform.
A lot of content is off-screen, and can't be swiped back on-screen.
-
Actual app did let me theough the flow, but had things like dialogs staying open and covering content (I assume a non tech person would have missed them)
I think part of the AI gap right now is the UI/UX you're expecting. AI isn't able to meet the bar for UI work for me at all right now. I do use it via Tab completion, or chat for specific scaffolding, but the agentic form is rough
But you're inventing the strawman that anyone who thinks the AI isn't there yet is in denial... some of us just have a higher bar than tailwind slop.
(Ironically I think SWEs most impressed by current agents are really done for... no niche non-tech knowledge to translate into novel software. But also don't have great taste/product sense: otherwise no one would have to point out that the UI/UX is not good enough.)
You're not having a discussion in good faith.
The comments I replied to cover it quite well:
> Very interesting observation. I haven’t written a function by hand in 18 months.
> Same. I haven't written any code by hand in some time. Oh well. I guess I'm just doing it wrong.
Using LLMs to help doesn't make it vibe coding: these are people claiming they write no code at all and only prompt agents.
You can't build a complex product with good UX/UI this way... but you can generate Tailwind slop.
I have some publicly accessible projects there.
Otherwise, I do professional software consulting + a variety of projects as an independent.
Whether you agree or not, the market has spoken. New grad hiring is WAY down. Fresh CS grads are having an hell of a time finding work compared to 2 years ago.
But hearing your 10 line constraint gives me a very https://en.wikipedia.org/wiki/Infinite_monkey_theorem vibe to the challenge.
I have also seen it fail on far simpler tasks.
It varies so much depending on what you are doing, the language etc that generic proclamations "it works!" or "it doesn't work!" are pretty much meaningless.
That aside, you seem to be conflating "it works" with "good enough to replace programmers", but these aren't synonyms.
And on the gripping hand, one way to "make" it work is simply to lower the standards. Which our industry has been doing aplenty for a long time now even before AI, so we shouldn't be surprised when top management drives it to its logical completion.
In my case I found having it always Ultrathink and to always work with TDD to work well. Also you have to use Plan Mode first and refine the plan. “What clarifying questions do you have for me” prompts me with a handful of numbered questions that are always really beneficial for refining the plan.
Will this checkpoint help with chat memory and disregard the latest chat's info?
I use WSL under Windows, VSCode with the WSL plugin, and Claude-Code installed on Ubuntu 24. It is generally solid and has no issue with this setup.
What I'm not liking is it seems even... lazier... than previously. By which I mean the classic "This is getting complicated so..." (followed by cop-out, dropping the original task and motivation).
There's also a bug where compaction becomes impossible. ("conversation too long" and its advice on how to fix doesn't work)
I have seen this issue with every model so far
"Differentiate y(x) w.r.t x, gamma and beta."
It not only produced the correct result, but it understood the context - I didn't tell it the context was layer norm, back-propagation and matrices.
This release is a step function for my use cases.
My screenshot came from here: https://docs.pytorch.org/docs/stable/generated/torch.nn.Laye...
How many years till AI can be trusted to deploy changes to production without supervision?
Maybe never.
I doubt LLM benchmarks more and more, what are they even testing?
How well the LLM does on the benchmarks. Obviously.
:P
It's some 3rd party thing for Elixir, a niche within a niche. I wouldn't expect an LLM to do well there.
> I doubt LLM benchmarks more and more, what are they even testing?
Probably testing by asking it to solve a problem with python or (java|type)script. Perhaps not even specifying a language and watching it generate a generic React application.
If your benchmark suite became popular enough and folks referenced it, the people training the LLMs would most likely try to make the model better at those languages.
Sometimes people expect to use LLMs to unearth hard to find information.
In reality, LLMs seem to quickly fall apart when you go from ubiquitous libraries with 200k stars on GitHub to one with "just" 1k stars.
What makes the situation worse is the way LLMs fail. Hallucinations where it goes "my usage example did not work because you are on the wrong version of the library/using the wrong SDK" etc. are super common in this scenario. This leads to further time wasted trying to apply reasonably plausible fixes that are entirely hallucinated.
> Reading the PDF skill documentation to create the resume PDF
> Here are the files and directories up to 2 levels deep in /mnt/skills/public/pdf, excluding hidden items and node_modules:
here's what we think about it:
- still sucks at generating pretty ui - great for creative writing and long-form planning - it’s really fast but not smarter than gpt-5 - pairs well with external tools/agents for research and automation - comes with a 1m token context window, so you can feed it monstrous codebases or giant docs - still hallucinates or stumbles on complex requests
It was a context merging task for my unorganized collection of agents… it sort of made sense, but was the exact reason I was asking it to do it… like you’re the bot, lol
CC switched to Sonnet 4.5 by default yesterday, I'm just very unimpressed. It seems like a considerable regression. Probably this is related to me using it to write Rust and not Python or JS/TS?
Example: I asked it to refactor a for loop to functional code with rayon, compiler barfs about mutation (it used par_iter()). It rolls back the code to what it was before. Then this happens:
Me: There is par_iter_mut().
Sonnet: Ah yes, but we can't use par_iter_mut() on self.vertices because we're calling self.set_vertex_position() which needs mutable access to the entire mesh (not just the vertices map).
However, self is &mut. Wtf?
This would have never happened with Opus, i.e. Opus would have used par_iter_mut() to start with (or applied the right fix w/o hand-holding after the compile failed with par_iter()).
I had just a bunch of those wtfs since yesterday from more or less zero before. I.e. it doesn't feel like coincidence.
idkmanidk•4mo ago
but: https://imgur.com/a/462T4Fu