just ask Claude to generate a tool that does this, duh! and tell Claude to make the changes to your side project and then to have sex with your wife too since it's doing all the fun parts
The key feature: use aliases instead of hardcoding model IDs. Your code references "summarizer", and a version-controlled lockfile maps it to the actual model. Switch providers by changing the lockfile, not your code.
Also handles streaming, tool calling, and structured output consistently across providers. Plus a human-curated registry (https://llmring.github.io/registry/) that I keep updated with current model capabilities and pricing - helpful when choosing models.
MIT licensed, works standalone. I am using it in several projects, but it's probably not ready to be presented in polite society yet.
However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.
This has been outstanding for what I have been developing AI assisted as of late.
/compact is helping you by reducing crap in your context but you can go further. And try to watch % context remaining and not go below 50% if possible - learn to choose tasks that don't require an amount of context the models can't handle very well.
I like to have it come up with a detailed plan in a markdown doc, work on a branch, and commit often. Seems not to have any issues getting back on task.
Obviously subjective take based on the work I'm doing, but I found context management to be way worse with Claude Code. In fact I felt like context management was taking up half of my time with CC and hated that. Like I was always worried about it, so it was taking up space in my brain. I never got a chance to play with CC's new 1m context though, so that might be a thing of the past.
My use case does better with the latter because frequently the agent fails to do things and then can't look back at intermediate.
E.g. Command | Complicated Grep | Complicated Sed
Is way worse than multistep
Command > tmpfile
And then grep etc. Because latter can reuse tmpfile if grep is wrong.
I find there's a quite large spread in ability between various models. Claude models seem to work superbly for me, though I'm not sure whether that's just a quirk of what my projects look like.
Sometimes in between this variability of performance it pops up a little survey. "How's Claude doing this session from 1-5? 5 being great." and i suspect i'm in some experiment of extremely low performance. I'm actually at the point where i get the feeling peak hour weekdays is terrible and odd hour weekends are great even when forcing a specific model.
While there is some non-determinism it really does feel like performance is actually quite variable. It would make sense they scale up and down depending on utilization right? There was a post a week ago from Anthropic acknowledging terrible model performance in parts of August due to an experiemnt. Perhaps also at peak hour GPT has more datacenter capacity and doesn't get degraded as badly? No idea for sure but it is frustrating when simple asks fail and complex asks succeed without it being clear to me why that may be.
It would, but
> To state it plainly: We never reduce model quality due to demand, time of day, or server load.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
If you believe them or not is another matter, but that's what they themselves say.
After all, using a different context window, subbing in a differently quantized model, throttling response length, rate limiting features aren’t technically “reducing model quality”.
It also consistently gets into drama with the other agents e.g. the other day when I told it we were switching to claude code for executing changes, after badmouthing claude's entirely reasonable and measured analysis it went ahead and decided to `git reset --hard` even after I twice pushed back on that idea.
Whereas gemini and claude are excellent collaborators.
When I do decide to hail mary via GPT-5, I now refer to the other agents as "another agent". But honestly the whole thing has me entirely sketched out.
To be clear, I don't think this was intentionally encoded into GPT-5. What I really think is that OpenAI leadership simply squandered all its good energy and is now coming from behind. Its excellent talent either got demoralized or left.
To be clear, I don't believe that there was any _intention_ of malice or that the behavior was literally envious in a human sense. Moreso I think they haven't properly aligned GPT-5 to deal with cases like this.
However, it’s the early days of learning this new interface, and there’s a lot to learn - certainly some amount of personification has been proven to help the LLM by giving it a “role”, so I’d only criticize the degree rather than the entire concept.
It reminds me of the early days of search engines when everyone had a different knack for which search engine to use for what and precisely what to type to get good search results.
Hopefully eventually we’ll all mostly figure it out.
Lots of other people also follow the architect and builder pattern, where one agent architects the feature while the other agent does the actual implementation.
The power of using LLMs is working out what it has encoded and how to access it.
> “My media panel has a Cat6 patch panel but no visible ONT or labeled RJ45 hand-off. Please locate/activate the Ethernet hand-off for my unit and tell me which jack in the panel is the feed so I can patch it to the Living Room.”
Really, GPT? Not just “can you set up the WiFi”??!
If GPT-5 is learning to fight and undo other models, we're in for a bright future. Twice as bright.
So this is something I've noticed with GPT (Codex). It really loves to use git. If you have it do something and then later change your mind and ask it to undo the changes it just made, there's a decent chance it's going to revert to the previous git commit, regardless of whether that includes reverting whole chunks of code it shouldn't.
It also likes to occasionally notice changes it didn't make and decide they were unintended side effects and revert them to the last commit. Like if you made some tweaks and didn't tell it, there's a chance it will rip them out.
Claude Code doesn't do this, or at least I never noticed it doing this. However, it has it's own medley of problems of course.
When I work with Codex, I really lean into a git workflow. Everything is on a branch and commit often. It's not how I'd normally do things, but doesn't really cost me anything to adopt it.
These agents have their own pseudo personalities, and I've found that fighting against it is like swimming upstream. I'm far more productive when I find a way to work "with" the model. I don't think you need a bunch of MCPs or boilerplate instructions that just fill up their context. Just adapt your workflow instead.
You could just say it’s another GPT-5 instance.
I wonder how long it will be before we get Opus 4.5
There's still a lot of low hanging fruit apparently
Pervert.
Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo
I am going to give this another shot but it will cost me $50 just to try it on a real project :(
Perhaps Tesla FSD is a similar example where in practice self driving with vision should be possible (humans), but is fundamentally harder and more error prone than having better data. It seems to me very error prone and expensive in tokens to use computer screens as a fundamental unit.
But at the same rate, I'm sure there are many tasks which could be automated as well, so shrug
https://jsbin.com/hiruvubona/edit?html,output
https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c
But just fot thrills I also asked for a "punk rocker"[2] and the result--while not perfect--is leaps and bounds above anything from the last generation.
0 -- ok, here's the first hurdle! It's giving me "something went wrong" when I try to get a share link on any of my artifacts. So for now it'll have to be a "trust me bro" and I'll try to edit this comment soon.
Edit: just to show my point, a regular human on a bicycle is way worse with the same model: https://i.imgur.com/flxSJI9.png
I bet their ability to form a pellican result purely because someone already did it before.
It is extremely common, since it's used on every single LLM to bench it.
And there is nothing logic, LLMs are never trained for graphics tasks, they dont see the output of a code.
I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...
Pretty solid progress for roughly 4 months.
Tongue in cheek: if we progress linearly from here software engineering as defined by SWE bench is solved in 23 months.
Curious to see that in practice, but great if true!
opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.
codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.
sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.
Codex was night and day the best. Codex scares me, Claude feels like a useful tool.
If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.
I'm glad they at least gave me the full $100 refund.
Maybe we’re entering the Emo Claude era.
Per the system card: In 250k real conversations, Claude Sonnet 4.5 expressed happiness about half as often as Claude 4, though distress remained steady.
It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.
I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good
Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least
I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.
To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.
The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.
I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.
Why is this? Does Anthropic have just higher infrastructure costs compared to OpenAI/xAI?
Is it number of lines? Tickets closed? PRs opened or merged? Number of happy customers?
Have you heard of that study that shows AI actually makes developers less productive, but they think it makes them more productive??
EDIT: sorry all, I was being sarcastic in the above, which isn't ideal. Just annoyed because that "study" was catnip to people who already hated AI, and they (over-) cite it constantly as "evidence" supporting their preexisting bias against AI.
Have you looked into that study? There's a lot wrong with it, and it's been discussed ad nauseam.
Also, what a great catch 22, where we can't trust our own experiences! In fact, I just did a study and my findings are that everyone would be happier if they each sent me $100. What's crazy is that those who thought it wouldn't make them happier, did in fact end up happier, so ignore those naysayers!
Or do he know just just get to work for 2 hours and enjoy the remaining 6 hours doing meaningful things apart from staring at a screen.
Of the two of you, I know which one I'd bet on being "right". (Hint: It's the one talking about their own experience, not the one supplanting theirs onto someone else)
To that poster:
Literally everyone in development is using AI.
The difference is "negative" people can clearly see that it's on a trajectory in the NEAR, not even distant, future to completely eat your earnings, so they're not thrilled.
You're in the forest and you're going "Wow, look at all these trees! Cool!"
The hubris is thinking that you're a permanent indispensable part of the loop.
Most of the anti-AI comments I see on HN are NOT a version of "the problem with AI is that it's so good it's going to replace me!"
We birthed a level of cognition out of silicon that nobody would imagine even just four years ago. Sorry, but some brogrammers being worried about making ends meet is making me laugh - it's all the same people who have been automating everyone else's jobs for the past two decades (and getting paid extremely fat salaries for it), and you're telling me now we're all supposed to be worried because it's going to affect our salaries?
Come on. You think everyone who's "vibe coding" doesn't understand the pointlessness of 90% of codemonkey work? Hell, most smart engineers understood that pointlessness years ago. Most coders work on boring CRUD apps and REST APIs to make revenue go up 0.02%. And those that aren't, are probably working on ads.
It's a fraction of a fraction that is at all working on interesting things.
Personally, yeah, I saw it coming and instead of "accepting fate", I created an AI research lab. And I diversified the hell out of my skillset as well - started working way out of my comfort zone. If you want to keep up with changing times, start challenging.
Hey, so if I DO see it, can I stop it from happening?
What "discussion" do you want to have? Another round of "LLMs are terrible at embedded hardware programming ergo they're useless"? Maybe with a dash of "LLMs don't write bug-free software [but I do]" to close it off?
The discussions that are at all advancing the state of the art are happening on forums that accept reality as a matter of fact, without people constantly trying to constantly pretend things because they're worried they'll lose their job if they don't.
That's... not super surprising? SwiftUI changes pretty dang often, and the knowledge cutoff doesn't progress fast enough to cover every use-case.
I use Claude to write GTK interfaces, which is a UI library with a much slower update cadence. LLMs seem to have a pretty easy time working with bog-standard libraries that don't make giant idiomatic changes.
Checkmate, aitheists.
You want proof for critical/supportive criticism? Then almost in the same sentence you make an insane claim without backing this up by any evidence.
Here are a few projects that I made these past few months that wouldn't have been possible without LLMs:
* https://github.com/skorokithakis/dracula - A simple blood test viewer.
* https://www.askhuxley.com - A general helper/secretary/agent.
* https://www.writelucid.cc - A business document/spec writing tool I'm working on, it asks you questions one at a time, writes a document, then critiques the idea to help you strengthen it.
* A rotary phone that's a USB headset and closes your meeting when you hang up the phone, complete with the rotary dial actually typing in numbers.
* Made some long-overdue updates on my pastebin, https://www.pastery.net, to improve general functionality.
* https://github.com/skorokithakis/support-email-bot - A customer support bot to answer general questions about my projects to save me time on the easy stuff, works great.
* https://github.com/skorokithakis/justone - A static HTML page for the board game Just One, so you can play with your friends when you're physically together, without needing to bring the game along.
* https://github.com/skorokithakis/dox - A thing to run Dockerized CLI programs as if they weren't Dockerized.
I'm probably forgetting a lot more, but I honestly wouldn't have been bothered to start any of the above if not for LLMs, as I'm too old to code but not too old to make stuff.
EDIT: dang can we please get a bit better Markdown support? At least being able to make lists would be good!
1 is not infinitely greater than 0.
What are the specific tasks + prompts giving you an 3x increased output, and conversely, what tasks don't work at all?
After an admittedly cursory scan of your blog and the repos in your GH account I don't find anything in this direction.
https://www.theverge.com/ai-artificial-intelligence/787524/a...
Yeah, maybe it is garbage. But it is still another milestone, if it can do this, then it probably does ok with the smaller things.
This keeps incrementing from "garbage" to "wow this is amazing" at each new level. We're already forgetting that this was unbelievable magic a couple years ago.
> I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev).
You made a critical comment yet didn't follow your own rules lol.
> it's so helpful for meaningful conversation!
How so?
FWIW - I too have used LLMs for both coding and personal prompting. I think the general conclusion is that it when it works, it works well but when it fails it can fail miserably and be disastrous. I've come to conclusion because I read people complaining here and through my own experience.
Here's the problem:
- It's not valuable for me to print out my whole prompt sequence (and context for that matter) in a message board. The effort is boundless and the return is minimal.
- LLMs should just work(TM). The fact that they can fail so spectacularly is a glaring issue. These aren't just bugs, they are foundational because LLMs by their nature are probabilistic and not deterministic. Which means providing specific defect criteria has limited value.
have you tried in the new xcode extension? that tool is surprisingly good in my limited use. one of the few times xcode has impressed me in my 2 yeasrs of use. read some anecdotes that claude in the xcode tool is more accurate than standard claude code for Swift. i havent noticed that myself but only used the xcode tool twice so far
This was in stark contrast to my experience with TypeScript/NextJS, Python, and C#. Most of the time output quality for these was at least usefully good. Occasionally you’d get stuck in a tarpit of bullshit/hallucination around anything very new that hadn’t been in the training dataset for the model release you were using.
My take: there simply isn’t the community, thought leadership, and sheer volume of content around Swift that there is around these other languages. This means both lower quantity and lower quality of training data for Swift as compared to these other languages.
And that, unfortunately, plays negatively into the quality of LLM output for app development in Swift.
(Anyone who knows better, feel free to shoot me down.)
edit: as far as what the numbers mean, they are arbitrary. They are only useful insofar as you can run two models (or two versions of the same model) on the same benchmark, and compare the numbers. But on an absolute scale the numbers don't mean anything.
Also as a Max $200 user, feels weird to be paying for an Opus tailored sub when now the standard Max $100 would be preferred since they claim Sonnet is better than Opus.
Hope they have Opus 4.5 coming out soon or next month i'm downgrading.
I used to use cc, but I switched to codex (and it was much better) ... no I guess I have to switch batch to CC, at least to test it
I use AI for different things, though, including proofreading posts on political topics. I have run into situations where ChatGPT just freezes and refuses. Example: discussing the recent rape case involving a 12-year-old in Austria. I assume its guardrails detect "sex + kid" and give a hard "no" regardless of the actual context or content.
That is unacceptable.
That's like your word processor refusing to let you write about sensitive topics. It's a tool, it doesn't get to make that choice.
As a rather hilarious and really annoying related issue - I have a real use where the application I'm working on is partially monitoring/analyzing the bloodlines of some rather specific/ancient mammals used in competition and... well.. it doesn't like terms like "breeders" and "breeding"
I wonder if the 1m token context length is coming for this ride too?
I don't know if it's me, but over the last few weeks I've got to the conclusion ChatGPT is very strongly leading the race. Every answer it gives me is better - it's more concise and more informative.
I look forward to testing this further, but out of the few runs I just did after reading about this - it isn't looking much better
Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)
I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output
idkmanidk•1h ago
but: https://imgur.com/a/462T4Fu