Codex vs. Claude Code (today)

https://build.ms/2025/12/22/codex-vs-claude-code-today/

104•gmays•1mo ago

Comments

N_Lens•1mo ago

A lot of (carefully hedged) pro Codex posts on HN read suspect to me. I've had mixed results with both CC and Codex and these kinds of glowing reviews have the air of marketing rather than substance.

thedelanyo•1mo ago

Exactly my thoughts. Most of these posts are what I'll say "paid posts".

pitched•1mo ago

The usage limits on Claude have been making it too hard to experiment with. Lately, I get about an hour a day before hitting session/weekly limits. With Codex, the limits are higher than my own usage so I never see them.

Because of that, everyone who is new to this will be focused on Codex and write their glowing reviews of the current state of AI tools in that context.

sebzim4500•1mo ago

For what it's worth I just switched from claude code to codex and have found it to be incredibly impressive.

You can check my history to confirm I criticize sama far too much to be an OpenAI shill.

mold_aid•1mo ago

Yeah. I can excuse bad writing, I can tolerate evangelism. I don't have patience for both.

mergesort•1mo ago

As the author of the post I think it was a nice quick post to share my perspective of a behavior I’ve been seeing across many (but not all) developers recently, but I’m always open to feedback for how to improve my writing!

And as I mentioned here (https://news.ycombinator.com/item?id=46392900) I have no affiliation with any of the organizations, nor care to evangelize any of them. Nobody pays me to write, I’m just a guy on the internet sharing his thoughts, building software, and teaching people how to use AI better with any tool people want to use. :)

baq•1mo ago

I’ve been using frontier Claude and GPT models for a loooong time (all of 2025 ;)) and I can say anecdotally the post is 100% correct. GPT codex given good enough context and harness will just go. Claude is better at interactive develop-test-iterate because it’s much faster to get a useful response, but it isn’t as thorough and/or fills in its context gaps too eagerly, so needs more guidance. Both are great tools and complement each other.

jstummbillig•1mo ago

If only fair comparisons would not be so costly, in both time and money.

For example, I have a ChatGPT and a Gemini subscription, and thus could somewhat quickly check out their products, and I have looked at a lot of the various Google AI dev ventures, but I have not yet found the energy/will to get more into Gemini CLI specifically. Antigravity with Gemini 3 pro did some really wonky stuff when I tried it.

I also have a Windsurf subscription, which allows me to look at any frontier model for coding (well, most of the time, unless there's some sort of company beef going). This I have often used to check out Anthropic models, with much less success than Codex with > GPT-5.1 – but of course, that's without using Clode Caude (which I subscribed to for a month, idk, 6 months ago, and seemed fine back then but not mind blowingly so).

Idk! Codex (mostly using the vscode extension) works really well for me right now, but I would assume this is simply true across the board: Everything has gotten so much better. If I had to put my finger on what feels best about codex right now, specifically: Least amount of oversights and mistakes when working on gnarly backend code, with the amount of steering I am willing to put into it, mostly working off of 3-4 paragraph prompts.

mergesort•1mo ago

Heya, I'm the author! I can promise you that I am 0% affiliated with OpenAI and have no qualms with calling them out for the larger moral, ethical, and societal questions that have emerged with the strategy they've pushed.

I do earnestly believe their models are currently the best to work with as software developers, but as I state in my post I think this is the state of the world today and have no premonition for that being true forever.

Same questions apply to Anthropic, Google, etc, etc — I'm not paid by anyone to say anything.

willaaam•1mo ago

This blog post lacks almost any form of substance.

It could've been shortened to: Codex is more hands off, I personally prefer that over claude's more hands-on approach. Neither are bad. I won't bring you proof or examples, this is just my opinion based on my experience.

deepdarkforest•1mo ago

> Codex is more hands off, I personally prefer that over claude's more hands-on approach

Agree, and it's a nice reflection of the individual companie's goals. OpenAI is about AGI, and they have insane pressure from investors to show that that is still the goal, hence codex when works they could say look it worked for 5 hours! Discarding that 90% of the time it's just pure trash.

While Anthropic/Boris is more about value now, more grounded/realistic, providing more consistent hence trustable/intuitive experience that you can steer. (Even if Dario says the opposite). The ceiling/best case scenario of a claude code session is a bit lower than Codex maybe, but less variance.

dworks•1mo ago

Well, if you had tried using GPT/Codex for development you would know that the output from those 5 hours would not be 90% trash, it would be close to 100% pure magic. I'm not kidding. It's incredible as long as you use a proper analyze-plan-implement-test-document process.

mergesort•1mo ago

Heya, author here! Admittedly this was a quick blog post I fired off, much shorter than my usual writing.

My goal wasn't to create a complete comparison of both tools — but to provide a little theory a behavior I'm seeing. You're (absolutely) right that it's a theory not a study, and I made sure to state that in the post. :)

Mostly though the conclusion describes pretty succinctly why I wrote the post, as a way to get more people to try more of the tools so they can adequately form their own conclusions.

> I think back to coworkers I’ve had over the years, and their varying preferences. Some people couldn’t start coding until they had a checklist of everything they needed to do to solve a problem. Others would dive right in and prototype to learn about the space they would be operating in.

> The tools we use to build are moving fast and hard to keep up with, but we’ve been blessed with a plethora of choices. The good news is that there is no wrong choice when it comes to AI. That’s why I don’t dismiss people who live in Claude Code, even though I personally prefer Codex.

> The tool you choose should match how you work, not the other way around. If you use Claude, I’d suggest trying Codex for a week to see if maybe you’re a Codex person and didn’t know it. And if you use Codex, I’d recommend trying Claude Code for a week to see if maybe you’re more of a Claude person than you thought.

> Maybe you’ll discover your current approach isn’t the best fit for you. Maybe you won’t. But I’m confident you’ll find that every AI tool has its strengths and weaknesses, and the only way to discover what they are is by using them.

willaaam•1mo ago

Hey! Didn't mean my comment negatively towards you in any way, though I now realize it might've come across as such. Blogs with opinions based on experiences alone are absolutely fine, thanks for sharing.

What I did mean is to indicate that your blog felt like a HN comment to me, where I generally expect a HN link to be news or facts that subsequently spark a discussion.

At the end of your post I guess I was hoping or expecting facts or examples, indicating it was engaging enough to read to the end.

Happy holidays!

mergesort•1mo ago

No problem at all! I read it as a bit pithy, but I didn’t think it was particularly mean spirited.

If you check out my writing on build.ms and fabisevi.ch you’ll see that the majority of it is meant to be evergreen observations of a concept or a moment in time. My goal is to make people think and to think about thinking, more than it is to tell people what exactly to think.

If I had to summarize my style in one sentence, it would walking people to and around an idea, and leaving the rest as an exercise to the reader. Naturally, this means I have less control over how people interpret my writing so I do try and cover my bases with fact and experience, but that still means sometimes I won’t deliver a complete picture to everyone.

In that case, sometimes I come to a place like HN or Bluesky or Mastodon where my post is being discussed and try add some perspective and clarity through constructive conversation. :)

If I’m being honest, I think we’re too early in the state of generative AI as a coding tool to draw very strong factual conclusions for many of our experiences using AI to code that will hold up well. I’m not implying it’s all vibes, but I think it would be pretty hard to wrap up my post in a bow the way you’re suggesting. On the other hand I’m always open to well-considered feedback — and would love to know more about your experience if you’re interested in sharing!

That’s a long way of saying happy holidays to you as well!

willaaam•1mo ago

Most of my AI coding experience is through Github Copilot (GHCP), mostly because that is available to me professionally. GHCP has improved greatly over the past half year in my opinion. I do use it a lot, burning up my enterprise allowance almost every month working on complex python codebases.

When it comes to models in GHCP, I vastly prefer Claude over Codex. It's not that Codex is bad, it just feels tonedeaf to me. It writes code in its own preferred style and doesn't adjust to the context of the codebase. Additionally, for me, Sonnet and Opus are much less prone to getting stuck in loops for longer or more complex agentic tasks.

I do like Codex for review tasks. When I'm working on something complex, both planning and implementation, I frequently ask Codex to review Claude's work, and it does a good job at that, frequently catching a mistake or coming up with a different angle.

I've toyed with kilocode, cline and the related forks through Claude Opus 4.5 API, but I'd argue my experience with Claude Sonnet/Opus through Copilot has just been... better. More consistent. Faster.

Sometimes I code with local models, when I'm working on highly confidential projects or data. Prefer GPT-OSS 20b or Qwen3-coder-30b then, but without an agentic harness as prompts get big and slow.

I would find it a nice read to work a case and see two models/harnesses duke it out, see whether it matches your expectations and gut feeling.

judahmeek•1mo ago

ADHD devs should definitely try Conductor or Crystal or some other git worktree manager.

adastra22•1mo ago

It’s funny because my use of Claude Code is the opposite. I use slash commands with instructions to find context, and basically never interact with it while it is doing its thing.

kukkeliskuu•1mo ago

How do you get it to stop to ask you something sometimes when it is doing its thing?

bdangubic•1mo ago

instruct it to stop and ask something sometimes when it is doing its thing. it is one of my core instructions at every level of its memory. if instructed, it will stop when it feels like should stop and in my personal experience it is suprisingly good at stopping. I’ve read here a lot of people having a different experience and opting for smaller tasks instead though…

kukkeliskuu•1mo ago

My question was misleading. For me Claude Code appears sometimes to stop too often at a random point instead to ask instead of keeping going. I guess that is the point of the linked article that Codex works differently in this regard.

bdangubic•1mo ago

how interesting!! I have had totally opposite experience

kukkeliskuu•1mo ago

For example, I just got: "I've identified the core issue - one of the table cells is evaluating to None, causing the "Unknown flowable type" error. This requires further debugging to identify which specific cell is problematic."

adastra22•1mo ago

I don’t. Why would I want that?

CjHuber•1mo ago

I do feel like the Codex CLI is quite a bit behind CC. If I recall correctly it took months for Codex to get the nice ToDo Tool Claude Code uses in memory to structure a task into substeps. Also I‘m missing the ability to have the main agent invoke subagents a lot.

All this of course can be added using MCPs, but it’s still friction. The Claude Code SDK is also way better than OpenAI Agents, it’s almost no comparison.

Also in general when I experienced bugs with Codex I was always almost sure to find an open GitHub issue with people already asking about a fix for months.

Still I like GPT-5.2 very much for coding and general agent tasks, and there is EveryCode which is a nice fork of Codex that mitigates a lot of shortcomings

frwickst•1mo ago

You can use Every Code [1] (a Codex fork) for this, it can invoke agents, but not just codex ones, but claude and gemini as well.

[1] https://github.com/just-every/code

CjHuber•1mo ago

Seems like you wrote at the same time I did my edit, yes Every Code is great however Ctlr+T is important to get terminal rendering otherwise is has performance problems for me

sumedh•1mo ago

> with people already asking about a fix for months.

OpenAI needs to get access to Claude Code to fix them :)

dist-epoch•1mo ago

The general consensus today is that ToDo tool is obsolete and lowers performance for frontier models (Opus 4.5, GPT-5.2)

AbrahamParangi•1mo ago

Respectfully I don’t think the author appreciates that the configurability of Claude Code is its performance advantage. I would much rather just tell it what to do and have it go do it, but I am much more able to do that with a highly configured Claude Code than with Codex which is pretty much just set at the out of the box quality level.

I spend most of my engineering time these days not on writing code or even thinking about my product, but on Claude Code configuration (which is portable so should another solution arise I can move it). Whenever Claude Code doesn’t oneshot something, that is an opportunity for improvement.

monerozcash•1mo ago

Hey, I'm not very familiar with Claude Code. Can you explain what configuration you're referring to?

Is this just things like skills and MCPs, or something else?

CharlesW•1mo ago

Skills, MCPs, /commands, agents, hooks, plugins, etc. I package https://charleswiltgen.github.io/Axiom/ as an easily-installable Claude Code plugin, and AFAICT I'm not able to do that for any other AI coding environment.

monerozcash•1mo ago

You can do basically all that with codex, although claude might have slightly more convenient tooling. The end result will be the same anyway.

CharlesW•1mo ago

That hasn't been my experience, although I'm happy to accept that I'm the problem. Apparently they've released their skills support (?), so I should try again. https://developers.openai.com/codex/skills

dist-epoch•1mo ago

OpenCode, Pi are even more configurable.

AbrahamParangi•1mo ago

I wrote my own agentic coding harness (it's quite easy) but I use claude code because opus's competence with its own tools is very high.

mergesort•1mo ago

Heya, I'm the author of the post and I just wanted to say I do appreciate the configurability! As I mentioned in the post, I have been that kind of developer in the past.

> This is a perfect match for engineers who love configuring their environments. I can’t tell you how many full days of my life I’ve lost trying out new Xcode features or researching VS Code extensions that in practice make me 0.05% more productive.

And I tried to be pretty explicit about the idea that this is a very personal choice.

> Personally — and I do emphasize this is a personal decision — I‘d rather write a well-spec’d plan and go do something else for 15 minutes. Claude’s Plan Mode is exceptional, and that‘s why so many people fall in love with Claude once they try it.2

For every person who feels like me today, there's someone who feels like you out there. And for every person who feels like you, there's someone like me (today) who finds it not as valuable to their workflow. That's the reason my conclusion was all about getting folks to try out both to see what works for them — because people change and it's worth finding out who you really at this moment in time.

Anyhow, I do think that Codex is also very configurable — I was just trying to emphasize that it's really great out the box while Claude Code requires more tuning. But that tuning makes it more personal, which as you mention is a huge plus! As I've touched on in a few posts [^1] [^2] Skills are to me a big deal, because they allow people to achieve high levels of customization without having to be the kind of developer that devotes a lot of time to creating their perfect set up. (Now supported in both Claude Code and Codex.)

I don't want this to turn into a bit of a ramble so I'll just say that I agree with you — but also there's a lot of nuance here because we're all having very personal coding experiences with AI — so it may not entirely sound like I agree with you. :)

Would love to hear more about your specific customizations, to make sure that I'm not missing out on anything valuable. :D

[1]: https://build.ms/2025/10/17/your-first-claude-skill/ [2]: https://build.ms/2025/12/1/scribblenauts-for-software/

AbrahamParangi•1mo ago

To be quite clear, I hate configuring my environment. I hate it. The farther I get from creating things that people can use, the less I like it. I spend most of my time on claude config not because I enjoy the experience per se but because it's SO USEFUL to do so.

mergesort•1mo ago

To be honest that's most of my pitch for Codex in the blog post. Codex works great without any configuration, and amazingly with. If you want to spend less time configuring then maybe Codex is the right agentic system for you.

I don't want to restate my thesis too much — but I really do believe it's worth experimenting with these tools every couple of months to see if the latest updates better match your preferences.

motoboi•1mo ago

It's hard to compare the two tools because they change so much and so fast.

Right now, as an example, claude code with opus 4.5 is a beast, but before that, with sonnet 4.0, codex was much better.

Gemini-cli, on the other hand, with gemini-flash-3.0 (which is strangely good for the "small and fast" model), it's very good (but the cli and the user experience are not on par with codex or claude yet).

So we need to be in constant observations of those tools. Currently (after gemini-flash-3.0 came out), I tend to submit the same task to claude (with opus) and gemini to understand the behaviour. gemini is surprising me.

mergesort•1mo ago

Heya, author here! I completely agree with you — and why the post is titled Codex vs. Claude Code (Today). I also have this very specific disclaimer in the second paragraph to note that this post is a reflection of a moment in time. :D

> Before we continue, I need to make a disclaimer: This post is about the Claude Code and Codex, on December 22, 2025. Everything in AI changes so fast that I have almost no expectations about the validity of these statements in a year, or probably even 3-6 months from now.

That said I do what you do and try different models when I want to see if things have changed. I run my own private little benchmarks with a few complex real world tasks, and I really love seeing how things are progressing — both in terms of quality but also the novel quirks that are introduced, changed, or removed. :)

sixhobbits•1mo ago

This is an interesting opinion but I would like to see some proof or at least more details.

What plans are you using, what did you build, what was the output from both on similar inputs, what's an example of a prompt that took you two hours to write, what was the output, etc?

mergesort•1mo ago

Heya, author here!

I'll try to answer these one by one, but I will just note that a lot of my prompts are domain specific so it's hard to share those.

- I don't use any plans — my writing is the plan. The Plan Mode in Claude Code is excellent, but as I've switched to Codex (which doesn't have one) I will simply write up a nice long prompt and then add "Please ask any clarifying questions you may have, or for any additional details that you need" — and it works great! I may go back and forth for anywhere from 5-30 minutes depending on what else is needed, but that's basically the experience of using Plan Mode in Claude Code too.

- I've built quite a few recent features for my app Plinky [^1]. I've made a few meaningful contributions to my open source project Boutique [^2] (and have been having AI asynchronously sketch out a large new database relationships feature). I built my new blog and my workshops pages [^3] with Codex as well. Truth is I do practically everything in Codex and Claude Code these days, so I'd have more trouble listing what I haven't built lately.

- Plinky's upcoming Reader Mode is a good example of a prompt that took me two hours, but the feature isn't yet in the app so I'd prefer not to share the prompt. But I can share the first draft of the prompt for Boutique's relationships feature sine that's open source. [^4] I've been experimenting with using ChatGPT Pulse to make progress on it every day (simply by asking it to!), and much to my surprise it's been designing a new API day by day in a way that's far from perfect but certaintly has been very interesting.

The honest truth is that this one did not take two hours and I wrote it on the bus so it's probably not perfect, but the descriptive process is effectively the same. For a feature like Reader Mode you would have to capture more details to scale up to the additional complexity of a domain-specific feature with client and server components, a new download queueing pipeline, amongst other abstractions.

Hope that answers your questions!

[^1]: https://plinky.app [^2]: https://github.com/mergesort/Boutique [^3]: https://build.ms [^4]: https://gist.github.com/mergesort/04a77c47ea4cb6433aa9ade4e1...

sixhobbits•1mo ago

Thanks! Would be interesting to compare with a looser approach, I use shorter prompts that take me a few seconds to write and it's also built some impressive stuff for me, I've mainly been using claude code and a bit of Gemini/Amp/Droid so I'll add Codex to the mix this week. So far I don't really feel the difference much between tool or model.

Rperry2174•1mo ago

I've noticed a lot of these posts tend to go codex vs claude, but as author is someone who does AI workshops curious why Cursor is left out of this post (and more generally posts like this).

From my personal experience I find cursor to be much more robust because rather than "either / or" its both and can switch depending on the time or the task or whatever the newest model is.

It feels like the same way people often try to avoid "vendor lock in" in software world that Cursor allows freedom for that, but maybe I'm on my own here as I don't see it naturally come up in posts like these as much.

tin7in•1mo ago

Speaking from personal experience and talking to other users - the agents/harnesses of the vendors are just better and they are customized for their own models.

Rperry2174•1mo ago

what kinds of tasks do you find this to be true for? For a while I was using claude code inside of the cursor terminal, but I found it to be basically the same as just using the same claude model in there.

Presumably the harness cant be doing THAT much differently right? Or rather what tasks are responsibilities of the harness could differentiate one harness from another harness

tin7in•1mo ago

This becomes clearer for me with harder problems or long running tasks and sessions. Especially with larger context.

Examples that come to mind are how the context is filled up and how compaction works. Both Codex and Claude Code ship improvements regarding this specific to their own models and I’m not sure how this is reflected in tools like Cursor.

dist-epoch•1mo ago

Github Copilot also allows you to use both models, codex, claude, and gemini on top.

Cursor has this "tool for kids" vibe, it's also more about the past - "tab, tab, enter" low-level coding versus the future - "implement task 21" high level delegating.

oldandboring•1mo ago

I feel you brother/sister. I actually pay for Claude Code Max and also for the $20/mo Cursor plan. I use Claude Code via the VSCode extension running within the Cursor IDE. 95% of my usage is Claude Code via that extension (or through the CLI in certain situations) but it's great having Cursor as a backup. Sometimes I want to have another model check Claude's work, for example.

mergesort•1mo ago

Heya, author here! That's a great question! I fully understand the vendor lock-in concern, but I'll just quickly note that when it comes to a first workshop I do whatever makes the person most comfortable. I let the attendee choose the tool they want — with a slight nudge towards Codex or Claude Code for reasons I'll mention below. But if they want to do the workshop in Cursor, VS Code, or heck MS Paint — I'll try to find a way to make it work as long as it means they're learning.

I actually started teaching these workshops by using Cursor, but found that it fell short for a few reasons.

Note: The way that my workshops work is that you have three hours to build something real. It may be scoped down like a single feature or a small app or a high quality prototype, but you'll walk away with what you wanted to build. More importantly you'll have learned the fundamentals of working with AI in the process, so you can continue this on your own and see meaningful results. We go through various exercises to really understand good prompting (since everyone thinks they're good but they rarely are), how to build context for models, and explore the landscape of tools that you can use to get better results. A lot of that time is actually spent in a Google Doc that I've prepped with resources — and the work we do there makes the code practically write itself by the time we're done.

Here's a short list of why I don't default to Cursor:

1. As I noted in another comment, the model performance is just so much better [^1] when accessed directly through Codex and Claude Code, which means more promising results more quickly. Previously the workshops were 3-4 hours just to finish, now it's a solid 3 with time to ask questions afterwards. You can't beat this experience, because it gives the student more time to pause and ask questions, seep in what they've done, and not spend time trying to understand the tools just to see results. 1a. The amount of time it took someone to set up Cursor was pretty long. The process for getting a good set up is pretty long — especially for someone non-technical. This may not be as big of a deal for developers using Cursor — but even they don't know a lot of the settings and tweaks to make to get Cursor to be great out the box.

2. The user experience of dropping a prompt into Codex/Claude Code and watch it start solving a problem is pretty amazing. I love GUIs — I spend my days building one [^3], but the TUI melting away everything to just being chat is an advantage when you have no mental model for how this stuff works.

3. As I said in #1, the results are just better. That's really the main reason! I

Not to toot my own horn, but the process works. These are all testimonials in the words of people who have attended a workshop, and I'm very proud of how people not only learn during the workshop but how it sets them off on a good path afterwards. [^2]. I have people messaging me 24 hours later telling me that they built an app their partner has wanted for years, to tell me that they've completed the app we started and it does everything they dreamed of, and hear more process over the weeks and months after because I urge them to keep sending me their AI wins. (It's truly amazing how much they grow, and I now have attendees teaching ME things — the ultimate dream of being a teacher knowing you gave them the nudge they needed.)

Hope that helps and isn't too much of an ad — I really just want to make it clear that I try to do what works best and if the best way to help people learn changes I will gladly change how I work. :)

[^1] https://news.ycombinator.com/item?id=46393001 [^2]: https://build.ms/ai#testimonials [^3]: https://plinky.app

robbiep•1mo ago

I got a student subscription to cursor and after giving it a good 6 hours I’ve abandoned it.

I extremely dislike the way it goes forth and bolts. I don’t trust these tools enough to just point it in the direction and say go, I like to be a human in the loop. Perhaps the use case I was working on then was difficult (quite old react native library upgrade across a medium sized codebase) but I eventually cracked this on Claude; cursor in both entropic and Gemini left me with an absolute mess.

Even repeatedly asking the prompt to keep me in the loop it kept on just running haywire.

lmeyerov•1mo ago

I've been using Claude code most of the year, and codex since soon after it released:

It's important to separate vibes coding from vibes engineering here. For production coding, I create fairly strict plans -- not details, but sequences, step requirements, and documented updating of the plan as it goes. I can run the same plan in both, and it's clear that codex is poor at instruction following because I see it go off plan most of the time. At the same time it can go on its own pretty far in an undirected way.

The result is when I'm doing serious planned work aimed for production PRs, I have to use Claude. When it's experimental and I don't care about quality but speed and distance, such as for prototyping or debugging, codex is great.

Edit: I don't think codex being poor at instruction following is inherent, just where they are today

ChicagoDave•1mo ago

Spec dev can certainly be effective, but having used Claude Code since its release, I’ve found the pattern of continuous refactoring of design and code produces amazing results.

And I’ll never use OpenAI dev tools because the company insists on a complete absence of ethical standards.

songodongo•1mo ago

Anthropic is partnered with Palantir though…

ChicagoDave•1mo ago

Yeah. It’s a thin line, but I still think Anthropic is _trying_ to thread the needle while Altman is trying to get the government to give content theft cover.

cube2222•1mo ago

I’ve checked out codex after the glowing reviews here around September / October and it was, all in all, a letdown (this was writing greenfield modules in a larger existing codebase).

Codex was very context efficient, but also slow (though I used the highest thinking effort), and didn’t adapt do the wider codebase almost at all (even if I pointed it at the files to reference / get inspired by). Lots of defensive programming, hacky implementations, not adapting to the codebase style and patterns.

With Claude Code and starting each conversation by referencing a couple existing files, I am able to get it to write code mostly like I would’ve written it. It adapts to existing patterns, adjusts to the code style, etc. I can steer it very well.

And now with the new cheaper faster Opus it’s also quite an improvement. If you kick off sonnet with a long list of constraints (e.g. 20) it would often ignore many. Opus is much better at “keeping more in mind” while writing the code.

Note: yes, I do also have an agent.md / claude.md. But I also heavily rely on warming the context up with some context dumping at conversation starts.

throwaway12345t•1mo ago

All codex conversations need to be caveat with the model because it varies significantly. Codex requires very little tweaking but you do need to select the highest thinking model if you’re writing code and recommend the highest thinking NON-code model for planning. That’s really it, it takes task time up to 5-20m but it’s usually great.

Then I ask Opus to take a pass and clean up to match codebase specs and it’s usually sufficient. Most of what I do now is detailed briefs for Codex, which is…fine.

IgorPartola•1mo ago

Why non-thinking model? Also 5-20 minutes?! I guess I don’t know what kind of code you are writing but for my web app backends/frontends planning takes like 2-5 minutes tops with Sonnet and I have yet to feel the need to even try Opus.

thejazzman•1mo ago

In my experience sonnet > opus, so it’s not surprise you don’t “need” opus. They charge a premium on sonnet now instead

throwaway12345t•1mo ago

I probably write overly detailed starting prompts but it means I get pretty aligned results. It does take longer but I try to think through the implementation first before the planning starts.

dworks•1mo ago

I will jump between a ChatGPT window and a VSCode window with the Codex plugin. I'll create an initial prompt in ChatGPT, which will ask the coding agent to audit the current implementation, then draft an implementation plan. The plan bounces between Chat and Codex about 5 times, with Chat telling Codex how to improve. Then Codex implements, creates an implementation summary, which I give to Chat. Chat then asks to add a couple of things fixes, then it's done.

pshirshov•1mo ago

On hard projects (really hard, like https://github.com/7mind/jopa), Codex fails spectacularly. The only competition is Claude vs Gemini 3 Pro.

funnyfoobar•1mo ago

The process you have described for Codex is scary to me personally.

it takes only one extra line of code in my world(finance) to have catastrophic consequences.

even though i am using these tools like claude/cursor, i make sure to review every small bit it generated to a level, where i ask it create a plan with steps, and then perform each step, ask me for feedback, only when i give approval/feedback, it either proceeds for the next step or iterate on previous step, and on top of that i manually test everything I send for PR.

because there is no value in just sending a PR vs sending a verified/tested PR

with that said, I am not sure how much of your code is getting checked in without supervision, as it's very difficult for people to review weeks worth of work at a time.

just my 2 cents

mergesort•1mo ago

Heya, I’m the author of the post! To be clear I have AI write probably 95% of my code these days, but I review every line of code that AI writes to make sure it meets my high standards. The same rules I’ve always had still apply — to quote @simonw “your job is to deliver code you have proven to work”.

So while I’m enthusiastic about AI writing my code in the literal sense, it’s still my code to understand and maintain. If I can’t do that then I work with AI to understand what was written — and if I can’t then I’ll often give it another go with another approach altogether so I can generate something I can understand. (Most of the time working together to understand the code works better, because I love to learn and am always open to pushing my boundaries to grow — and this process can tuned well to self-directed learning.)

And to quote a recent audit: “this is probably one of the cleanest codebases I’ve ever audited.” I say that emphasize the fact that I care a lot about the code that goes into my codebase, and I’m not interested in building layers of unchecked AI slop for code that goes into my apps.

funnyfoobar•1mo ago

Thanks for the clarification.

personally it would be too difficult for me to understand large chunks of work, like in your case "a week's worth of code" at a time. just wondering how do you go about it?

second, how do you pass such large PR's to your co-workers?(if you have any)

mergesort•1mo ago

So I will state upfront that my current experience is not the most common team dynamic because I'm an indie developer [^1]. But I've worked at many companies — as small as 2 and as large as Twitter — so I am very familiar with the variety of engineering processes.

I can share how I work with agentic systems, because I (and now others) have found it to be very effective. I still have the engineering-like experience of thinking deeply — I've gotten great results across codebases small and large — and almost everyone who I've run a workshop with has come back to me and said that this was a missing piece for them when they work with agentic systems.

I'm the kind of person I alluded to at the end of my blog post when I wrote "Some people couldn’t start coding until they had a checklist of everything they needed to do to solve a problem.", so this description will be representative of that.

1. I start a document in Craft [^2] whenever I think of a great feature, and keep adding to that doc over the next few months whenever I have a new idea. I try to turn that document into something cohesive — imagine something like a PRD without the formality.

2. Then when it comes time to build the feature, I will just sit and write out a prompt (with lots of pointers to source code and relevant screenshots) that considers everything that needs to be built. I'll write out our goals for the feature, how the client should work, how the server should behave, the expected user experience, and anything else that's relevant. That process is really clarifying because it unearths a whole bunch of meaningful context — and context is exactly what a large language model needs!

3. Last but not least I'll simply add something like "Please ask any clarifying questions you may have, or for any additional details that you may find helpful". That leads to questions which I spend anywhere from another 5 to 30 minutes on, which fills in the gaps that I hadn't even considered to consider. And sure that may take time, but now the model has *so many useful details* that most people never add to their context window.

4. Once you have that, the model can act much more surgically than the experience most people have with agentic systems. Since it's so surgical I can go do something else like work on my newsletter, my AI workshops, or even go for a walk. This is why I much prefer to work this way, as opposed to the hands-on process I described Claude Code users [often] preferring in the blog post. (Which as I mentioned there is perfectly fine, just not my cup of tea anymore.)

---

I'd still like to touch on working with people though. I do quite a bit of open source work and there I still follow what people would consider standard processes and best practices. If I'm doing a week's worth of work I still don't want to dump a whole ton of code in one commit, so I'll break everything down into very atomic commits that spell out exactly what I'm doing. I also write lots of documentation, update references, and add tests like a person should.

But there's also nothing to say you have to generate a week's worth of code in one go. It's important to remember that you're in control of how you work. It may be more fitting to define smaller tasks (which will take less time for each independent step) and work on them serially, which you can then hand off to your coworkers one by one.

Ultimately my message is that people still need to exercise their best judgment and think for themselves. AI doesn't change what we've come to accept as best practices, it automates and accelerates them. In fact, the models keep getting better the more they are trained on our best practices, so my assertion is that success using AI seems to correlate well with autonomy, creativity, and critical thinking skills.

Anyhow, long answer for a short question — but I hope it helps! And if there's anything unclear: please ask any clarifying questions you may have, or for any additional details that you may find helpful.

[^1]: https://plinky.app [^2]: https://craft.do

cherryteastain•1mo ago

I think the author glosses over the real reason why tons of people use Codex over CC: limits. If you want to use CC properly you must use Opus 4.5 which is not even included in the Claude Pro plan. Meanwhile you can use Codex with gpt-5.2-codex on the ChatGPT Plus plan for some seriously long sessions.

Looks like Gemini plans have even more generous limits on the equivalently priced plans (Google AI Pro). I'd be interested in the experiences of people who used Google Antigravity/Gemini CLI/Gemini Code Assist for nontrivial tasks.

Tiberium•1mo ago

A small correction: Opus 4.5 is included in the Pro plan nowadays, but yeah, the usage limits for it on the $20 sub are really, really low.

sourcecodeplz•1mo ago

Opus IS included in Pro plan.

throwawaybla73•1mo ago

Opus 4.5 is included in the Pro plan.

cherryteastain•1mo ago

Thanks for the correction, looks like I misremembered. But limits are low enough with Sonnet that, I imagine you can barely do anything serious with Opus on the Pro plan.

Maxious•1mo ago

Both Claude Pro and Google Antigravity free tier have Opus 4.5

oldandboring•1mo ago

Personally I bit the bullet and went with the Max plan for Claude Code. After tax it costs me ($108) less than I earn from one billable hour. I have been punishing it for the last two months, it defaults to Opus 4.5 and while I occasionally hit my session limit (it resets after an hour or so), I can't even scratch the surface of my monthly usage limit.

victorbjorklund•1mo ago

And it has gotten very bad. I almost never hit the limit on the pro plan before with CC but now it happens very fast.

mergesort•1mo ago

Heya, author here! I do agree with you that this is a big downside, but I don’t know if this is the primary reason.

In my experience teaching people, most people don’t actually know much at the time they make this decision. They’ve heard about Cursor, they’ve heard of Claude Code, and they may have heard about Codex. But what they’ve heard is anecdotes and marketing — they don’t yet have hands-on experience.

They make a big choice and then assume that this is how all AI works, because they don’t have a full breadth of context yet. And that’s to be expected! That’s how most things work.

That is why I teach the workshops I do to make AI accessible, so people can walk through the tradeoffs and make the best educated choices for them.

A couple of comments here have said that the post is subtly pro-Codex, but I tried to make my point very explicit: people should try a lot of things and see what works best for them. But it’s very hard to do that without investing a lot of time because the market is so nascent and moving so fast. This post exists to try and nudge people into exploring more of the tools they haven’t tried yet, so they can make their own informed decisions like you have. :)

All that’s to say, people definitely hit limits with Claude Code (as I have done myself) — especially if they’re hesitant to upgrade to Claude Max because they haven’t gotten enough out of Claude Pro. But I think the real reason people make the choices they do starts earlier in the process, even before they get a lot of hands on experience with Claude Code or Codex.

btbuildem•1mo ago

I don't think the comparison to programming languages holds, maybe very tenuously at best. Coding assistants evolve constantly, you can't even be talking about "Codex" without specifying the time range (ie, Codex 2025-10) because it's different from quarter to quarter. Same with CC.

I believe this is the main source of disagreement / disappointment when people read opinions / reviews, then proceed to have an experience very different from expected.

Ironically, this constant improvement/evolution erodes product loyalty -- personally, I'm a creature of habit and will stay with a tool past its expiry date; with coding assistants / sota llms, I cancel and switch subscriptions all the time.

mergesort•1mo ago

Heya, author of the post here! I think you're right in everything you've said, but I want to note that the programming language comparison was meant to be metaphorical more than literal. Everything is changing so fast (as I mention in the post a few times), but I have seen some (far from all) people get locked into Claude Code or Codex in a way where they won't even consider alternatives the same way people they chose Ruby to start their career and now identify as Ruby developers.

My goal was to open people's minds just a little bit by saying exactly what you're getting at — everything is moving fast and we should be reassessing often. A meaningful difference is that you can start a codebase with Claude Code and then switch to Codex with almost no friction, while you can't just migrate a TypeScript app to Python in 15 minutes.

All that's to say, we agree!

veidr•1mo ago

I tried so hard to make Codex work, after the glowing reviews (not just from Internet randos/potential-shills, though; people I know well, also).

It's objectively worse for me on every possible axis than Claude Code. I even wondered if maybe I was on some kind of shadow-ban nerf-list for making fun of Sam Altman's WWDC outfit in a tweet 20 years ago. (^_^)

I don't love Claude's over-exuberant personality, and prefer Codex's terse (arguably sullen) responses.

But they both fuck up often (as they all do), and unlike Claude Code (Opus, always), Codex has been net-negative for me. I'm not speed-sensitive, I round-robin among a bunch of sessions, so I use the max thinking option at all times, but Codex 5.1 and 5.2 for me are just worse code, and worse than that, worse at code review to the point that it negated whatever gains I had gotten from it.

While all of them miss a ton of stuff (of course), and LLM code review just really isn't good unless the PR is tiny — Claude just misses stuff (fine; expected), while Codex comes up with plausible edge-case database query concurrency bugs that I have to look at, and squint at, and then think hmm fuck and manually google with kagi.com for 30 minutes (LIKE AN ANIMAL) only to conclude yeah, not true, you're hallucinating bud, to which Codex is just like. "Noted; you are correct. If you want, I can add a comment to that effect, to avoid confusion in future."

So for me, head-to-head, Claude murders Codex — and yet I know that isn't true for everybody, so it's weird.

What I do like Codex for is reviewing Claude's work (and of course I have all of them review my own work, why not?). Even there, though, Codex sometimes flags nonexistent bugs in Claude's code — less annoying, though, since I just let them duke it out, writing tests that prove it one way or the other, and don't have to manually get involved.

oldandboring•1mo ago

I must be doing something wrong. When I last tried to use Codex 5.2 (via Cursor), no amount of prompting could get it to stop aggressively asking me for permission to do things. This seems to be the opposite of the article's claim, which is that Codex is better for long-running, hands off tasks.

mergesort•1mo ago

Heya, I'm the author of the post! This was probably unintentional but I think you're making a really valuable observation that will be helpful to others.

The models Cursor provides to use in their product are intermediated versions of models that companies like OpenAI and Anthropic offer. They are technically using Codex, but not in the way that they would be if you were in a tool like Codex (CLI) or Claude Code.

If you ask Cursor to solve a tough problem, Cursor will break down the problem into a different problem before sending that request to OpenAI so they can use Codex. They do this because: 1. To save money. By restructuring the prompt they can use less tokens, saving them money for running Cursor since they are the ones paying for the tokens with your subscription cost. 2. [Based on things the Cursor team has said] They believe they can construct a better intermediate prompt that is more representative of the problem you want to solve.

This extra level of abstraction means that you are not getting the best results when you use a tool like Cursor. OpenAI and Anthropic are running their harnesses Codex CLI and Claude Code at a loss (because VC), but providing better results. This is not the best way to make money, but it's a great way to build mindshare and hopefully get customers for life. (People are fickle and cheap though so I doubt this is a customers for life strategy the way people buy the same brand of deodorant once they start buying Dove.)

Happy to answer any questions you may have, but mostly I would highly suggest trying out Codex CLI and Claude Code to get a better feel for what I'm saying — and to also to get more out of your AI tools. :)

mulmboy•1mo ago

Is it just me or is codex slow?

With claude code I'll ask it to read a couple of files and do x similar to existing thing y. It takes a few moments to read files and then just does it. All done in a minute or so.

I tried something similar with codex and it took 20 minutes reading around bits of file and this and that. I didn't bother letting it finish. Is this normal? Do I have something misconfigured? This was a couple of months ago.

MuffinFlavored•1mo ago

> You’ve got your CLAUDE.md, Skills, Agents, MCP, slash commands, and so much more.

How many people use none of this?

mergesort•1mo ago

Heya, author of the post here. That's a good call out because it's probably a lot!

And now that you mention it, that's also one failure case for why some people look at AI and go "this just isn't very good at coding". I'm not saying it has to be that way nor will it be that way forever, but there are absolutely a lot of people who just download Claude Code or Cursor or Codex and dive right in without any additional set up.

That's partially why I suggest people use Codex for the workshops I offer, because it provides the best results with no set up. All of these tools have a nearly unending amount of progressive disclosure because there's so much invisible configuration and best practices are changing so fast. I'm still trying not to imply that one tool is "better" than another (even if I have my preference), but more so hit on the fact that which AI tools people like is mostly about your preferred set of tradeoffs.

bdangubic•1mo ago

> That's partially why I suggest people use Codex for the workshops I offer, because it provides the best results with no set up.

I would do the exact opposite… If we are pitching “this shit works magically without any setup” people will expect magic and they absolutely will not get magic as there is no magic. I believe, especially if we are educators (you obviously are!!) that it is our responsibility to teach it “right” - my workshop would probably spend at least 75% of the time on the setup

mergesort•1mo ago

So I definitely understand where you're coming from, but let me provide a little bit of context.

The workshops are 3-4 hours and we do spend a lot of time discussing how things work in reality vs. how they work in the context of the workshop. It's worth noting that these workshops span the gamut of non-technical people in sales to seasoned developers, so a lot of people simply won't learn much (or have the excitement to learn on their own) if we spend the first 2-3 hours setting things up.

In my experience the heaviest lift for teaching practically any technical subject is getting someone interested by showing them how to accomplish something they care about, and then leaving them with lots of information and resources so they can continue experimenting and growing even once we're done. The way I do that is to make sure they leave the workshop having built their own idea — without taking shortcuts!

Being able to use Codex to accomplish something because you spent an hour crafting a good prompt isn't cheating, it's learning the skill of becoming a better technical communicator — in a short period of time — and taking advantage of the skill you've just learned. I don't consider that magic, it's actually the core tenant of building with AI, and is very much how I work with AI every day.

I'm late for dinner so I should probably stop here, so I'll leave just one final note. After every workshop I send each student a list of personalized resources that will help them continue on their journey by demystifying things that we may have glossed over or weren't clear in the workshop — so they should be armed with the tools to take their next steps away from any magical thinking.

It's a bit hard to boil down exactly what I do and how I try to design for best hands-on pedagogical practices in an HN post I'm writing on the go — but I am absolutely open to your thoughts! :)

bdangubic•1mo ago

you are doing God’s work! great stuff!

mergesort•1mo ago

Thank you so much! That’s very kind of you to say. :)

MuffinFlavored•1mo ago

What if I said "I don't use any of those and I think AI is very good at coding".

I'd be more interested in an article from you on how to go from "I use Claude Code out of the box" as a baseline and then how each extra layer (CLAUDE.md, skills, agents, MCP, etc.) improve its effectiveness.

mergesort•1mo ago

That does sound like a good article! Sadly I've never written it and probably won't because I think a lot of that stuff doesn't provide as much value as people assume — which is why my personal conclusion in the post is to just lean into Codex.

This isn't a value judgment, it's just a question of where my priorities and tradeoffs lie. That said, I think Skills are the killer feature because they are a very composable tool — which I'll get to in a bit.

- Your CLAUDE.md should be a good high-level description with relevant details that you add to over time. Think of it as describing the lay of the land the way you would to a new coworker, including the little warts they need to know about before they waste hours on some known but confusing behavior.

- MCP has it's purposes, but it's not really a great tool for software development. It's best served for interfacing with a remote service (because it provides a discovery layer to LLMs on top of an API), but if you use them the way developers are told to, you're almost always better off using an equivalent CLI.

- I'll skip over agents, because an agent is basically a skill + a separate context window, and the main selling point is the context window bit. I think over time we'll see a separation of concerns where you can just spawn a skill with a context window and everyone will forget about the idea of agents in your codebase.

So now Skills. I wrote a well-received post [^1] a few months ago about Claude Skills, and why I think they are probably the most important of these tools. A skill is basically a plain-text description of an application.

The app can be something like I describe where Claude Code converts a YouTube video to an mp3 based on natural language, or you can have a code review skill, a linter skill, a security reviewer skill, and so on. This is what I meant when I said skills are composable.

You can imagine a team having lots of skills in their repo. One may guide an agentic system to build iOS projects well (away from an LLM's bad defaults when building with Xcode), skills that are very contextually relevant to the team, or even skills that enforce copy in your app to conform to your marketing team's language.

Skills are just markdown so they're very portable — and now available in Codex and many other places.[^2] (I had been using OpenSkills to great effect since the way Skills work is just through prompts). I now have a bunch of skills that do lots of things, for coding, marketing, data analysis, fact checking, copy-editing, and more. As a nice benefit they run in Claude — not just Claude Code. If you have ideas for processes you need to improve, I would invest my time and energy into building up Skills more than anything else.

[^1]: https://build.ms/2025/10/17/your-first-claude-skill [^2]: https://agentskills.io/specification

MuffinFlavored•1mo ago

What % of effectiveness would you say is gained from these because... I am a pretty regular user of Claude Code in VS Code with no special goodies and I routinely hit "compacting" after 5-6 prompts in a single session. Then I need to validate it didn't slop all over the place. I can't imagine having 3-4 agents in the background (extra things to check) being a net-positive.

Skills and slash commands for sure but... I don't see them as necessary? "Review this code for ____ XYZ" as a skill. To me it's just something you could do as a prompt in your session?

mergesort•1mo ago

You raise a couple of really good questions!

1. I find Claude Code's handling of the context window to be pretty poor, and one of the reasons why I use it for smaller things versus multi-hour coding sessions. I'm not sure what dark magic OpenAI has done to make their context window feel infinite, but Codex has become a better choice for that at the moment.

2. A small note on subagents but Claude Code did this right. Subagents are granted their own context window, so they don't spill over into your context window until they're done doing their own work — and the added context is relatively minimal. I'd love to see OpenAI adopt this pattern as well, especially in combination with something like Skills rather than leaning into MCP.

3. When I suggested adding skills, I mean ones that are far more complicated than your example, and can drive a chunk of work autonomously. The skill I use for writing in-app copy (which I'm bad at because you can see I'm never short for words) is about 100 lines long. It includes my style guide as an accessible resource, and a mostly complete history of my Bluesky posts to help achieve the authentic tone I when discussing Plinky. (I write all of my posts, so this really is my voice.)

These kinds of skills save me a lot of time as an indie developer! As I mentioned I have ones for data insights, fact-checking, and of course for code. My main suggestion would be to think through every step of your work and see if they can be automated, and then turn small pieces of that into skills.

—--

It's hard to assign a specific percentage to how much my effectiveness has improved, but it's a lot. The reason I don't want to put a number on it is that what I've gotten is a far broader set of skills (no pun intended) that allows me to execute in parallel. The metaphor I'd use to describe all this is to say that I'm no longer single-threaded.

I am a big believer that right now models work best for people who are effectively running small businesses — or teams that operate lean. The work of 10 can be done by 4-5 motivated and well-armed people, or an indie like me can do every facet of the work involved and do it well. I sit down and focus on explaining the big picture with great detail, and then set things off so I can do every part of the work involved in a round-robin style.

While an engineering task is going I'm off writing my newsletter with my words but with a skill that does meaningful research for me. While I'm running some research I'm in Figma working on social media assets. While I'm doing code review for my app's code I've got the server side building in the background.

Last week I had Codex finding a domain for me, with specific requirements. (Here's a simplified version of the prompt.)

> I need a domain to represent this concept [+ 200 words], based on the code in this repository. [Code included so Codex really knows what the heck I'm building and talking about.] Don't show me any domains over $50/year at this registrar. Make sure it's a real word with no fun typos like tumblr.com is short for tumbler, and no compound words like "thisisfun.com". You can start with this list of tlds, but if you think there are any other ones that could be a good match then you can make a suggestion.

And after about 10 messages back and forth Codex found something that would have taken me far longer to research on my own — in parallel.

This all means that I'm able to write code, do marketing, design, support (which is always me and not AI), and run my business. If I plan well what I get is an extra set of hands to hand things off to, and most of the time (honestly) it does the work perfectly. But even for the times it doesn't, if it gets me 80-90% of the way there, that's a huge head start over where I would have been previously.

So the reason that I'm hesitant to answer this with a specific percentage is that your experience across organizations will vary. But I've seen in my work (solo engineering work, teaching, and consulting) is that the gains are pretty prosperous. That's true for roles where you're singularly focused on writing code — but the key is to lean into the strengths of this system and be creative about how you use it.

As I said — incapable of keeping my writing short so I hope that helps!

josefrichter•1mo ago

I think a big part is which model seems to work better with your language/stack. My language is Elixir, which is somewhat niche, and only Claude has been able to produce usable Elixir code so far. None of the other things mentioned in the article mattered, because of this. I wonder if others have this experience that some models just struggle with some languages/stacks?

mergesort•1mo ago

Hey Jose, author here! That's a great call out. I write predominantly in Swift and for a long time Claude was the only usable option. But sometime around GPT-5 OpenAI's models got much better at Swift, so the choice started becoming more about aesthetics (as a descriptor of preferences). So you're right — if the model can't write coherent code then it doesn't matter what kind of flow you feel as you're working with the tools — but I do imagine this will continue to improve for all languages including Elixir.

CharlesW•1mo ago

> I write predominantly in Swift…

If it's okay to mention my own project, I'd appreciate it if you could check out https://charleswiltgen.github.io/Axiom/ (open source) and let me know what you think. It's focused on modern Swift, with specialized skills for helping developers get to strict Swift 6 concurrency.

mergesort•1mo ago

I've only skimmed it since I'm between Christmas and a longer vacation that starts in 24 hours, but this actually looks really neat! I'll definitely take a closer look to at these skills in depth — but this is exactly the kind of thing I've been telling people to take the time to invest in for their agentic environments. :)

BugsJustFindMe•1mo ago

I switched from OpenAI models to using Anthropic ones some time ago, and every once in a while I briefly check in again just in case. I continue to be amazed at how infuriating OpenAI's agents are. They do things I never asked for, make decisions I didn't ask them to make, impute assumptions I never suggested, and just generally rapidly piss me off. I find it maddening, and then I immediately switch back to using Claude and it's like an immediate wave of relief washes over me because Claude just always seems to follow along with me and do what I actually want.

onehair•1mo ago

Article started really strong, and I was about to believe. Days and weeks worth of code in a matter of minutes? I'm ready.

Then author describes a website theme toggle feature.

Get out of here!!!!

mergesort•1mo ago

Hey there, post author here. I do actually generate days worth of code in minutes and weeks worth of code in hours (not minutes) — but didn't really cover them because this post was more conceptual than specifically covering a tactic or technique as other posts do.

But in case you're interested a talk that I gave in Spain this year just went live yesterday, and discusses not only real world use cases but also discusses a lot of the fundamentals of how these AI systems work to make that possible.

https://www.youtube.com/playlist?list=PLztE34GS_piKKQ6y1dkku...

mergesort•1mo ago

Hey there, article author here! I spent a lot of time writing up a comment that talks through my process so I'll just lazy link it here if that's ok. [^1]

But the truth is I do build real production features all the time! That just wasn't the focus of this article. :)

[^1]: https://news.ycombinator.com/item?id=46399123

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Bithumb mistakenly hands out $195M in Bitcoin to users in 'Random Box' giveaway

Beyond Agentic Coding

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

OpenBSD Copyright Policy

OpenClaw Creator: Why 80% of Apps Will Disappear

What Happens When Technical Debt Vanishes?

AI Is Finally Eating Software's Total Market: Here's What's Next

Computer Science from the Bottom Up

Show HN: I built a toy compiler as a young dev

You don't need Mac mini to run OpenClaw

Learning to Reason in 13 Parameters

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

Ask HN: Will GPU and RAM prices ever go down?

From hunger to luxury: The story behind the most expensive rice (2025)

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

The Story of Heroku (2022)

Obey the Testing Goat

Claude Opus 4.6 extends LLM pareto frontier

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Bithumb mistakenly hands out $195M in Bitcoin to users in 'Random Box' giveaway

Beyond Agentic Coding

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

OpenBSD Copyright Policy

OpenClaw Creator: Why 80% of Apps Will Disappear

What Happens When Technical Debt Vanishes?

AI Is Finally Eating Software's Total Market: Here's What's Next

Computer Science from the Bottom Up

Show HN: I built a toy compiler as a young dev

You don't need Mac mini to run OpenClaw

Learning to Reason in 13 Parameters

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

Ask HN: Will GPU and RAM prices ever go down?

From hunger to luxury: The story behind the most expensive rice (2025)

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

The Story of Heroku (2022)

Obey the Testing Goat

Claude Opus 4.6 extends LLM pareto frontier

Codex vs. Claude Code (today)

Comments