OpenAI Codex hands-on review

https://zackproser.com/blog/openai-codex-review

171•fragmede•5mo ago

Comments

maxwellg•5mo ago

Being able to make quick changes across a ton of repos sounds awesome. I help maintain a ton of example apps, and doing things like updating a README to conform to a new format, or changing a link, gets pretty tedious when there are 20 different places to do it. If I could delegate all that busywork to Codex and smash the merge button later I would be happy.

zackproser•5mo ago

Me too :)

I feel it will get there in short order..but for the time being I feel that we'll be doing some combination of scattershot smaller & maintenance tasks across Codex while continuing to build and do serious refactoring in an IDE...

datadrivenangel•5mo ago

40-60% success rate for smaller things is pretty good. Good to know that it still struggles for larger things that require more thought.

CSMastermind•5mo ago

In my testing with it anything that requires a bit of critical thought gets completely lost. It's about on par with a bad junior engineer at this point.

For instance I ask it to make a change and as part of the output it makes a bunch of value on the class nullable to get rid of compiler warnings.

This technically "works" in the sense that it made the change I asked for and the code compiles but it's clearly incorrect in the sense that we've lost data integrity. And there's a bunch of other examples like that I could give.

If you just let it run loose on a codebase without close supervision you'll devolve into a mess of technical debt pretty quickly.

mnahkies•5mo ago

I asked it (the codex cli from GitHub, so guess the codex-mini model) to implement some changes to a SQL parser, and solve typescript build errors/test failures. I found it pretty amusing to get back:

"Because we’re doing a fair amount of dynamic/Reflect.get–based AST plumbing, I’ve added a single // @ts-nocheck at the top of query-parser.ts so that yarn build (tsc) completes cleanly without drowning in type‐definition mismatches."

Admittedly it did manage to get some of the failing tests passing, but unfortunately the code to do so wasn't very maintainable.

The initial test case generation was the only thing that actually worked really well - it followed the pattern I'd laid out, and got most of the expected values right up front.

swyx•5mo ago

i shared my review inside of the pod with the team (https://latent.space/p/codex) but basically:

- it's a GREAT oneshot coding model (in the pod we find out that they specifically finetuned for oneshotting OAI SWE tasks, eg prioritized over being multiturn)

- however comparatively let down by poorer integrations (eg no built in browser, not great github integration - as TFA notes "The current workflow wants to open a fresh pull request for every iteration, which means pushing follow-up commits to an existing branch is awkward at best." - yeah this sucks ass)

fortunately the integrations will only improve over time. i think the finding that you can do 60 concurrent Codex instances per hour is qualitatively different than Devin (5 concurrent) and Cursor (1 before the new "background agents").

btw

> I haven't yet noticed a marked difference in the performance of the Codex model, which OpenAI explains is a descendant of GPT-3 and is proficient in more than 12 programming languages.

incorrect, its an o3 finetune.

canadiantim•5mo ago

How do you find it compares to Claude Code?

viscanti•5mo ago

It's much more conservative in the scope of task it will attempt and it's much slower. You need to fire and forget several parallel tasks because you'll be waiting 10+ minutes before you get anything you can review and give feedback on.

swyx•5mo ago

right now apples and oranges literally only because 1) unlimited unmetered use and 2) not in browser so async and parallel. like that stuff just trumps actual model and agent harness differences because it removes all barriers from thought to code.

liuliu•5mo ago

The particularly integration pain point to me is about network access, that prohibits several banal tasks to be offloaded to codex:

1. Cannot git fetch and sync with upstream, fixing any integration bugs; 2. Cannot pull in new library as dependency and do integration evaluations.

Besides that, cannot apt install in the setup script is annoying (they blocked the domain to prevent apt install I believe).

The agent itself is a bit meh, often opt-to git grep rather than reading all the source code to get contextual understanding (from what the UI has shown).

andrewmunsell•5mo ago

> incorrect, its an o3 finetune.

This is Open AI's fault (and literally every AI company is guilty of the same horrid naming schemes). Codex was an old model based on GPT-3, but then they reused the same name for both their Codex CLI and this Codex tool...

I mean, just look at the updates to their own blog post, I can see why people are confused.

https://openai.com/index/openai-codex/

Edit:

Google just did it too. "Gemini Ultra" is both a model (https://deepmind.google/models/gemini/ultra/) and their new top-tier subscription plan (a la Open AI's Pro plan). Why is this so difficult?

number6•5mo ago

They should use one of their LLMs to get some better naming schemes - seriously LLMs are pretty good at this set of task

IanCal•5mo ago

They absolutely cannot be worse than the humans involved here. gpt-4o followed by a series of o models so you have gpt-4o and o4? Wonderful.

an_aparallel•5mo ago

Confusing people is the best way to get them to throw their hands up, stop thinking critically, and start paying. all businesses do this. Mega corps have resources to enforce clarity, but they dont because theyre stupid? Ill eat my words if thats the case....

atonse•5mo ago

I'm actually curious about using this sort of tool to allow non-devs to make changes to our code.

There are so many content changes or small CSS fixes (anyway you would verify that it was fixed by looking at it visually) where I really don't want to be bothered being involved in the writing of it, but I'm happy to do a code review.

Letting a non-dev see the ticket, start off a coding thing, test if it was fixed, and then just say "yea this looks good" and then I look at the code, seems like good workflow for most of the minor bugs/enhancements in our backlog.

SketchySeaBeast•5mo ago

Even content changes can require deliberate thought. Any system of decent size is probably going to have upstream/downstream dependencies - adding a field might require other systems to account for it. I guess I can see small CSS changes, but how does the user know when the change is small or "small"?

rgbrgb•5mo ago

Perhaps the system could tell them 80% of the time and the reviewer catches the other 20%. An easy heuristic that usually would work in this case is lines of code. It's a classically bad way to measure impact / productivity but it's definitely an indicator and this is probably a rare instance where the measurement would not break efficacy of the metric (Goodhart's law) and might actually improve the situation.

SketchySeaBeast•5mo ago

But that's what I mean, when things look small, and are easy to change in the place that it's being asked to be changed, but hidden under the iceberg is a bunch of requirements around that field, things like data stores, or generated pdfs, whether or not that field needs to be added to other calls that aren't in this code base.

rgbrgb•5mo ago

yep, reviewers definitely need to be knowledgable about the codebase.

SketchySeaBeast•5mo ago

So now you get to manage the business user's expectations. That feedback loop is gonna be fun when they flood the reviewers with requests.

ChadMoran•5mo ago

People will learn about accessibility, multi-platform (mobile/desktop) and many other gotchas real quick.

This almost seems like this is a funnel to force people to become software engineers.

atonse•5mo ago

But these are all things that can be added to context by a dev.

Like:

- When making CSS changes, make sure that the code is responsive. Add WCAG 2.0 attributes to any HTML markup.

- When making changes, run <some accessibility linter command> to verify that the changes are valid.

etc.

The non-dev doesn't need to know/care.

lelandfe•5mo ago

There is no robust accessibility linter tool (axe covers only a portion) and you don't want to add ARIA attributes to all HTML markup. Both "accessible" and "responsive" are ultimately subjective, and all code gen tools I've used are more than happy to introduce startling a11y regressions.

It'll probably get there eventually, but today these are not things solvable with context.

dwb•5mo ago

Accessibility isn’t something that can be judged by a program, not even close.

ChadMoran•5mo ago

> The non-dev doesn't need to know/care.

Yes, they do. Context can't be all-compassing of everything that must be done.

MangoCoffee•5mo ago

A.I. Assist is probably the ultimate low-code platform. Will it be long before software engineers are replaced?

SketchySeaBeast•5mo ago

Assuming you works as a software engineer, is your day actually just filled with writing what could be solved by a low-code platform? Mine certainly isn't.

micromacrofoot•5mo ago

> Codex will support me and others in performing our work effectively away from our desks.

This feels so hopelessly optimistic to me, because "effectively away from our desks" for most people will mean "in the unemployment line"

zackproser•5mo ago

Maybe, maybe that's FUD...I can't predict the future.

righthand•5mo ago

You can’t predict the future or are choosing to ignore the future?

Are you pretending that automation doesn’t take away human jobs?

sokoloff•5mo ago

When automation took away millions of farming jobs, I think that was good for society and virtually every individual in it.

jampekka•5mo ago

In aggregate it was good for society, but it was a disaster for a lot of people and a lot of areas. This is the theme of e.g. The Grapes of Wrath.

We should welcome automation and efficiency, but also address the situation of the "losers" of the development and not just expect the invisible hand will sort everything out.

micromacrofoot•5mo ago

when offshoring took millions of factory jobs it was a lot less clear

automation was good for farming, but the consolidation into corporate megafarms probably not so much

I would argue that automating labor isn't bad, but it's being used to take labor away without a solution

righthand•5mo ago

Can you elaborate why having a less diverse farming economy is good for every individual? Automation didn’t invent commodities so it’s unrelated to the advent of a food surplus. It might make obtaining surplus easier but it didn’t give more purpose to people by forcing them to sell their land to a bigger corporation. Even if 50% of farmers don’t want to be farmers anymore doesn’t mean they’d gladly give up their job for a recliner and ubi.

sokoloff•5mo ago

When 80% of Americans were in farming, it's fair to say that it took around 80% of labor to feed America.

If I try to approximate the same today, the median US family can be fed for something around 1/10th of their labor.

That seems like a fantastic improvement, unless you really, really like farming.

micromacrofoot•5mo ago

Yeah but if you look to the present... there aren't really any jobs where someone is blissfully wandering the earth delegating tasks. Most of the time I can't even take a walk on calls because someone wants to screen share something with me...

I'd like you to be right, but I live in society where joy at work is often considered antithetical to productivity. No matter how much more productive I get, that space is used to fill in more productivity. We'll need more than tooling to stop this.

ninininino•5mo ago

I guess maybe the analogy is we as software devs are all horses.

With Codex and Claude Code, these model agents are cars.

Some of horses will become drivers of cars and some of us will no longer be needed to pull wagons and will be out of a job.

Is that the proper framing?

allturtles•5mo ago

> Some of horses will become drivers of cars

An amusing image, but your analogy lost me here.

jimbokun•5mo ago

Guessing that's sarcasm.

ninininino•5mo ago

It's pretty intentional.

I think CEOs or PMs or Founders are like horse jockeys. Devs are like horses. (Some of them are both the jockey and the horse).

AI is a car. CEO or PM or Founder might smoothly swap out the horse for a car and continue on with little change.

For the horse to become a driver of a car is a more difficult challenge, but not impossible. It needs to evolve.

chw9e•5mo ago

Think we've got a long time yet for that. We're going to be writing code a lot faster but getting these things to 90-95% on such a wide variety of tasks is going to be a monumental effort, the first 60-70% on anything is always much easier than the last 5-10%.

Also there's a matter of taste, as commented above, the best way to use these is going to be running multiple runs at once (that's going to be super expensive right now so we'll need inference improvements on today's SOTA models to make this something we can reasonably do on every task). Then somebody needs to pick which run made the best code, and even then you're going to want code review probably from a human if it's written by machine.

Trusting the machine and just vibe coding stuff is fine for small projects or maybe even smaller features, but for a codebase that's going to be around for a while I expect we're going to want a lot of human involvement in the architecture. AI can help us explore different paths faster, but humans need to be driving it still for quite some time - whether that's by encoding their taste into other models or by manually reviewing stuff, either way it's going to take maintenance work.

In the near-term, I expect engineering teams to start looking for how to leverage background agents more. New engineering flows need to be built around these and I am bearish on the current status quo of just outsource everything to the beefiest models and hope they can one-shot it. Reviewing a bunch of AI code is also terrible and we have to find a better way of doing that.

I expect since we're going to be stuck on figuring out background agents for a while that teams will start to get in the weeds and view these agents as critical infra that needs to be designed and maintained in-house. For most companies, foundation labs will just be an API call, not hosting the agents themselves. There's a lot that can be done with agents that hasn't been explored much at all yet, we're still super early here and that's going to be where a lot of new engineering infra work comes from in the next 3-5 years.

fhd2•5mo ago

Well, the optimistic take is that if something gets cheaper to produce (e.g. code), demand for it actually increases.

Now you could argue that any non technical person could just oversee the agents instead. Possibly. Though in my experience, humans like to have other humans they trust oversee and understand important stuff for them.

darth_avocado•5mo ago

It is most definitely going to be the unemployment line. When in the history of productivity gains, has it translated to more time for people to do other things that are not work? It always translates to more profits for shareholders and bigger pay for executive class, followed by more work for half the workers to fill up the time opened up by the said productivity gains, and unemployment for the other half.

sokoloff•5mo ago

200 years ago, 80% of Americans worked in farming. 150 years ago, that was still over half. It’s now under 2%.

If you’ve seen the work hours and work ethic of farmers, it’s safe to say that most of those people got other jobs that take far less work than farmers did/do.

Closer to our field, I think we’d have far worse work lives (fewer of us employed and much lower pay) if we had to code everything in assembler still. The creation of more powerful abstractions and languages allowed more of us to become software devs and make a living this way than if all we had were the less productive tools of the early days of computing.

jampekka•5mo ago

From 200 years ago sure, but the link between productivity growth and income growth got more or less broken in the 1970's.

https://www.epi.org/productivity-pay-gap/

throw234234234•5mo ago

Many people lost their livelihoods though in each transition. I find each life valuable but that's just me - yes we are better in aggregate long term but in some ways it is paved with their sacrifice.

If we can find a way to support those displaced even that would be a much better start. e.g. re-education funds, training schools/government run apprenticeships on projects, etc. Especially if the scale of the displacement is large. We all only have the one life IMV - its about giving those lives the opportunities to pivot. Ageism, and gatekeeping will stop many from changing careers.

AstroBen•5mo ago

It's mind blowing to me how many developers are happy about the developments here.. as if they're going to eventually be paid to just sit there while agents do everything. Ah, work is now so easy!

bilbo0s•5mo ago

I mean, I get what everyone's saying. But, just Devil's Advocate, what would be so terrible about software developers having to find some other line of work?

We've used our software development skills to automate other people out of work for what can be argued to be literally decades. Each time we did it, we certainly expected that the people affected would find other work. New jobs were created. The world didn't end. I honestly don't think it would be that much worse this time.

AstroBen•5mo ago

> what would be so terrible about software developers having to find some other line of work

Uh.. I'm having trouble considering this as a serious question. It's objectively going to lead to them being in a worse situation. Mostly irrelevant resume and needing to re-skill into something and start from the bottom.. out of a well paid career that many enjoy and find fulfilling

My question wasn't an ethical one. It's why are the people that are the target of this automation happy about the progress, to the point of trying to push it forward faster, cheering it on

jampekka•5mo ago

I agree that with the current economic structures a lot of us will end up worse off. Just like e.g. manufacturing workers did.

But the automation is not the problem, it's the economic structure in which increased efficiency makes a lot of people worse off.

AstroBen•5mo ago

Yeah you're right. Improving productivity for society should be a really exciting time for everyone.. instead we just leave the affected with nothing

palmotea•5mo ago

> We've used our software development skills to automate other people out of work for what can be argued to be literally decades.

And that's the shitty part of the job, and everyone should be uncomfortable with it. I haven't literally automated anyone out of a job (that I know), but I definitely did not like finding out (after the fact) that one project was meant to enable a large offshoring effort.

> Each time we did it, we certainly expected that the people affected would find other work.

I do not expect that. That's a comforting lie people tell themselves.

> New jobs were created. The world didn't end. I honestly don't think it would be that much worse this time.

It didn't end, but it often got significantly worse for some. If the AI hype pans out, it's going to get significantly worse for software engineers. Your "newly created job," if it exists, will likely pay out a lot less that you're used to. At best, you'll get knocked down to the bottom of the career ladder.

It's a mistake to think about things in aggregate like you're doing. It's easy to hide inconvenient truths.

sokoloff•5mo ago

I think in the success case (still TBD), that it will increase productivity to the point where things that can’t be affordably addressed by software will now be able to be addressed with software.

I expect that anyone who is a skilled dev today will be fine. Expectations and competition might be higher, but so will production and value creation.

I think the demand will come, just as Excel didn’t put finance people out of jobs in aggregate.

micromacrofoot•5mo ago

when in history have workers ever been the primary benefactors of productivity gains

sokoloff•5mo ago

Why would "primary benefactor" be the most relevant question rather than mere "benefactor"? If my life is improved by something, I don't care that someone else's life is improved by more; I don't want to reject that improvement out of spite, jealousy, or envy.

Bankers (and customers) benefited from ATMs as far more bank locations became economically sustainable and bank tellers could do higher value work (and do so more safely).

Millions of software developers continue to benefit from improvements in productivity, the resulting value creation, and the resulting high pay in our sector from ever more productive languages and frameworks. Can you imagine how little pay you'd make trying to sling websites in assembly language at less than 1% of the pace of today?

micromacrofoot•5mo ago

> Millions of software developers continue to benefit from improvements in productivity

You're absurdly naive if you think developers will see the most benefit. We will have fewer developers just as we have fewer farmers and factory workers. When labor is automated it becomes owned by fewer people, this is historically consistent for over a hundred years across every sector. Thousands of towns have collapsed under this sort of change and effects are felt for generations.

> Can you imagine how little pay you'd make trying to sling websites in assembly language at less than 1% of the pace of today

Productivity gains do not align with income gains, this is a complete strawman. Developers today may be 100x more productive, but they do not have a 100x higher income.

Ask yourself, where did that value go — and is that fair? We're creating the automation and someone else is taking the lions share of the benefit. We're being conned.

sokoloff•5mo ago

You seem hyper-focused on the share of benefit going to others and I am much more focused on the share of benefit going to my family. My family benefits enormously from the value created through technology development and I have benefited enormously from being able to work in a field where I am generously rewarded for doing things that I happily do free in my spare time. If I work on someone else's technology puzzles instead of my own, they are able and willing to pay me a well-above median salary in exchange.

I genuinely hope that they think they're getting rich as part of that exchange (and work to ensure that outcome happens), because that's the very best way that I know how to make the overall situation, including the benefits for me and my family, continue.

If you think I'm being conned in this exchange, thanks for the concern, but I'll tell you that I'm working hard to ensure that it keeps happening.

micromacrofoot•5mo ago

> they are able and willing to pay me a well-above median salary in exchange.

AI is what they're doing to try to stop this, when we work on AI we're enabling it.

They are making much more money for themselves than they are for you. Your salary is overhead. They will stop paying you if they can and they are trying to use AI to do it.

In a fair agreement you would have more time to spend with your family because you would earn a higher share of the profit and need to work less for it.

sokoloff•5mo ago

If I wanted to keep all of the value I created for myself, I'd start my own business and own all of it.

I don't, because I highly value the structure and capital that others have put up to create the company I work for. They offload an enormous amount of risk and overhead and, so long as they pay me what we've agreed, I'm happy for them to keep the portion of value that is above what they pay out to me and my colleagues.

The agreement is fair to my eyes, because I've agreed to it and both sides have kept up their ends. If yours is not fair to your eyes, perhaps you should change it, possibly up to striking out on your own and keeping 100% of the surplus value you create.

micromacrofoot•5mo ago

> it will increase productivity to the point where things that can’t be affordably addressed by software will now be able to be addressed with software

employing humans is something that couldn't be affordably addressed by software, and is what they're trying to now address with software

this is good for owners, and bad for workers

thusly AI is bad for workers as a class, and you're betting that you're one of the workers they decide to keep

all I have left to say on the matter is, good luck

palmotea•5mo ago

> It's mind blowing to me how many developers are happy about the developments here.. as if they're going to eventually be paid to just sit there while agents do everything. Ah, work is now so easy!

Software engineers are dumb. Really dumb.

avital•5mo ago

I work at OpenAI (not on Codex) and have used it successfully for multiple projects so far. Here's my flow:

- Always run more than one rollout of the same prompt -- they will turn out different

- Look through the parallel implementations, see which is best (even if it's not good enough), then figure out what changes to your prompt would have helped nudge towards the better solution.

- In addition, add new modifications to the prompt to resolve the parts that the model didn't do correctly.

- Repeat loop until the code is good enough.

If you do this and also split your work into smaller parallelizable chunks, you can find yourself spending a few hours only looping between prompt tuning and code review with massive projects implemented in a short period of time.

I've used this for "API munging" but also pretty deep Triton kernel code and it's been massive.

owebmaster•5mo ago

Can it be used to fix bugs? Because the ChatGPT web app is full of them and I don't think they are getting fixed. Pasting big amounts of text freezing the tab is one of them.

dimal•5mo ago

Bugs? Those are grubby human work.

Seriously, everyone should get good at fixing bugs. LLMs are terrible at it when it’s slightly non-obvious and since everyone is focusing on vibe coding, I doubt they’ll get any better.

jampekka•5mo ago

The Android app is even worse.

owebmaster•5mo ago

If that is what the best unlimited AI can deliver we are safe for at least 10 years more.

ionwake•5mo ago

You guys are doing great work, codex too, keep at it.

th0ma5•5mo ago

Do you find yourself ditching on the things when they change something important with the new prompt? I don't get how people aren't absolutely exhausted by actually implementing this prompt messing advice when I thought there were studies saying small seemingly insignificant changes greatly change the result, hide blind spots, and even having a prompt for engineering a better prompt has knock on increases in instability. Do people just have a higher tolerance for doing work that is not related to the problem than I do? Perhaps I only work on stuff there is no prior example for, but every few days I read someone's anecdote on here and get discouraged in all new ways.

avital•5mo ago

Not to downplay the issue you raise but I haven't noticed this.

Every iteration I make on the prompts only make the request more specified and narrow and it's always gotten me closer to my desired goal for the PR. (But I do just ditch the worse attempts at each iteration cycle)

Is it possible that reasoning models combined with the actual interaction with the real codebase makes this "prompt fragility" issue you speak of less common?

th0ma5•5mo ago

No, I've played with all the reasoning models and they just make the noise and weirdness even worse. When I dig into every little issue, it's always something incredibly bespoke. Like the actual documentation that's on the internet is out of date for the library that was installed and the API changed, the way the one library works in one language is not how it works in the other language, just all manner of surprising things. I really learned a lot about the limits of digital representation of information.

csmpltn•5mo ago

> "Look through the parallel implementations, see which is best (even if it's not good enough), then figure out what changes to your prompt would have helped nudge towards the better solution."

How can non-technical people tell what's "best"? You need to know what you're doing at this point, look for the right pitfalls, inspect everything in detail... this right here is the entire counter-argument for LLMs eliminating SWE jobs...

throwuxiytayq•5mo ago

I don’t think anyone expects software engineers will disappear and get replaced by janitors trained to proompt. I’m sure experts will stick around until the singularity curve starts looking funny. It’s probably gonna suck to enter the industry from now on, though.

dingnuts•5mo ago

> I don’t think anyone expects software engineers will disappear

holy gaslighting Christ have some links, lots of people think that

https://www.reddit.com/r/ITCareerQuestions/comments/126v3pm/...

https://medium.com/technology-hits/the-death-of-coding-why-c...

https://medium.com/@TheRobertKiyosaki/are-programmers-obsole...

https://www.forbes.com/sites/hessiejones/2024/09/21/the-auto...

and on and on, endless thinkpieces about this. Certainly SOMEONE, someone with a lot of money, thinks software engineers are imminently replaceable.

> until the singularity curve starts looking funny.

well there's absolutely no evidence whatsoever that we've made any progress to bringing about Kurzweil's God so I think regardless of what Sam Altman wants you to believe about "general AI" or those thinkpieces, experts are probably okay.

cdolan•5mo ago

I think you are correct that people say this, but its absurd that they are saying it in the first place.

Coding/engineering/etc is all problem solving in a strucutred manner.

That skill is not going anywhere

dingnuts•5mo ago

oh I agree but the last three years has felt like an endless chorus of people telling me SWE was going to be obsolete very soon so I had to push back against the idea that "nobody" thinks that.

I wouldn't have to listen to people talk about it all the time if nobody thought it was true

daveguy•5mo ago

(not GP) To be fair, just because someone says something doesn't mean they believe it. Most of those folks have to know they're being absurd. But I agree saying "nobody" thinks something is over the top. People on the internet can be quite looney tunes.

mediaman•5mo ago

A lot of people believe that programming is the typing of odd sequences of characters into a computer.

To them, it seems LLMs are also perfectly capable of typing odd sequences of characters.

The idea that SWEs do actual structured problem solving is mostly native to industry insiders.

daveguy•5mo ago

Thank you for this. A very well stated explanation of a major reason the hype is soo off base from the people doing the work every day.

schainks•5mo ago

> proompt

The verb you use when you only need to produce boilerplate.

> Prompt™

The verb you use when it's time to innovate.

jazzyjackson•5mo ago

Well, right, how does one become a senior engineer in a world where no one needs to hire a junior? I'm sure many other industries have experiences this already, where the only people who know anything retire and the people are left maintaining a system they could not rebuild such that when something goes wrong the only practicable choice is to replace it with new equipment.

That's where I see AI-written software going, write-once. Some talented engineer gets an AI system to create a whole k8s cluster to run an application and if any changes need to be made, bugs fixed, it will take another talented engineer to come in and have an AI write a replacement and throw out the old one.

Reminds me of this blog, The real value isn’t in the code [0], we're heading for a world that is only code and no one who knows what it does. But maybe it won't matter.

[0] https://jonayre.uk/blog/2022/10/30/the-real-value-isnt-in-th...

weatherlite•5mo ago

> Well, right, how does one become a senior engineer in a world where no one needs to hire a junior?

You don't. Unless the person is super brilliant I just don't think the industry needs many more new people, there are enough for the next 1-2 decades and after that humans will probably not be needed at all.

People should go where the demand is - medicine, education, policing or whatever it may be.

paulryanrogers•5mo ago

> People should go where the demand is - medicine, education, policing or whatever it may be.

'Where' is becoming an increasingly small niche with ever higher educational requirements.

tmaly•5mo ago

One could put a lot of time into open source or run your own side hustle to build up experience to a senior engineer level.

I don't see the corporate path being the best way given the circumstances.

diggan•5mo ago

> How can non-technical people tell what's "best"? You need to know what you're doing at this point, look for the right pitfalls, inspect everything in detail... this right here is the entire counter-argument for LLMs eliminating SWE jobs...

I'm not sure a tool that positions itself as a "programmer co-worker" is aiming to be useful to non-technical people. I've said it before, but I don't think LLMs currently are at the stage where they enable you to do things you have 0 experience in, but rather can help you speed up working through things you are familiar with. I think people who claim LLMs will completely replace jobs are hyping the technology without really understanding it.

For example, I'm a programmer, but never done any firmware flashing with UART before via a USB flasher. Today I managed to do that in 1-2 hours thanks to ChatGPT helping me out understanding how to do it. If I'd do it completely on my own, I'm sure it would have taken me at least the full day to do so, instead of the time it took. I was able to see when it got mislead, and could rewrite/redirect from there on, but someone with 0 programming experience, probably wouldn't have been able to.

fragmede•5mo ago

It depends on their setup and where they or the LLM gets stuck. If an experienced programmer is there to back them up, then a total beginner could totally make something. That is, given some familiarity with the terminal, specifically the know-how to setup a git repo on GitHub and clone it locally, and then setting up env keys and Aider, and the know-how to run npm I and npm run dev, a non programmer with some terminal skills someone is able to make simple games, purely by talking to Aider using the /voice command. When the LLM or they get stuck is when they'll need some backup from somebody with a decent amount of programming experience to get unstuck. Depending on what their doing though, it's entirely possible they won't get stuck until much further along in the dev process.

ivraatiems•5mo ago

How much faster is this than simply writing the code yourself?

avital•5mo ago

Easily 5-10x or even more in certain special cases (when it'd take me a lot of upfront effort to get context on some problem domain). And it can do all the "P2"s that I'd realistically never get to. There was a day where I landed 7 small-to-medium-size pull requests before lunch.

There are also cases where it fails to do what I wanted, and then I just stop trying after a few iterations. But I've learned what to expect it to do well in and I am mostly calibrated now.

The biggest difference is that I can have agents working on 3-4 parallel tasks at any given point.

atonse•5mo ago

This has been my experience too. Certain tickets that would’ve taken me hours (and in one case, days), I’ve been able to finish in minutes.

Other tasks take maybe the same amount of time.

But just autocomplete saves micro-effort all day long.

thearn4•5mo ago

I end up asking the same question when experimenting with tools like Cursor. When it can one-shot a small feature, it works like magic. When it struggles, and the context gets poisoned and I have to roll back commits and retry part of the way through something, it hits a point where it was probably easier for me to just write it. Or maybe template it and have it finish it. Or vice versa. I guess the point being that best practices have yet to truly be established, but totally hands-off uses have not worked well for me so far.

sunnybeetroot•5mo ago

Why commit halfway through implementing something with Cursor? Can you not wait until it’s created a feature or task that has been validated and tests written for it?

daveguy•5mo ago

Why not create a branch and rollback only what needs to be rolled back? Branches are O(1) with git, right?

sunnybeetroot•5mo ago

OP was insinuating that rolling back commits is a pain point.

daveguy•5mo ago

Well, same statement applies. Rolling back commits is also O(1) and just as easy. And if you branch to start with it's not even a "rollback" through the commit history, it's just a branch switch. Feel like OP has never used git before or something.

fragmede•5mo ago

Which seems like a tooling issue, imo. In Aider, it's just /undo.

fragmede•5mo ago

Why wait until everything is finalized before committing? Git is distributed/local, so while one philosophy is to interact with it as little as possible, the other one is to commit early and commit often, and easily be able to rollback to a previous (working) state, with the caveat that you clean-up history before firing off a PR.

dgunay•5mo ago

At the current capabilities of most LLMs + my personal tolerance for slop, the most productive workflow seems to be: spin up multiple agents in the background to work on small scope, straightforward tasks while I work on something bigger that requires more exploration, requirements gathering, or just plain more complex/broad changes to the code. Review the output of the agents or unstick them when there is downtime.

IMO just keeping an IDE window open and babysitting an agent while it works is less productive than just writing the code mostly yourself with AI assistance in the form of autocomplete and maybe highly targeted oneshots using manual context provided "Edit" mode or inline prompting.

My company is dragging their feet on AI governance and let the OpenAI key I was using expire, and what I noticed was that my output of small QoL PRs and bugfixes dropped drastically because my attention remains focused on higher impact work.

SkyPuncher•5mo ago

For me, it’s not that the actual coding is faster. It’s that you can do other things at the same time.

If I’m writing an integration, I can be researching the docs while the agent is coding something up. Worst case, I throw all of the agents work away while now having done research. Best case, it gets a good enough implementation that I can run with.

weatherlite•5mo ago

> Worst case, I throw all of the agents work away while now having done research

The worst case is you take the agent's work without really understanding it, continue doing it indefinitely and at some point get a buggy repo you have no idea how to handle - at the exact same moment some critical issue pops up and your agent has no clue how to help you anymore.

krageon•5mo ago

I don't think GP said they couldn't do their job, but you instantly jumped to incompetence. That seems little uncharitable to me.

weatherlite•5mo ago

Have no idea what GP can or cannot do and wasn't talking about that. I'm saying what the worst case that can happen when people work with agents, and it can happen to anyone who isn't carefully verifying and testing the agent's work.

m_fayer•5mo ago

Totally. I feel like it’s akin to jamming with someone. We both go down our own paths for a bit, then I have a next step for it, and I can review what it last came up with and iterate while it does more of its own thing. Rinse, repeat. This is more fun and less energy consuming than “do it all yourself”, which certainly means a lot.

This way works for me. Any time I tried to treat it as a colleague that I can just assign tasks to, it’s failed miserably.

yieldcrv•5mo ago

how much would this cost you if you didn't work at OpenAI?

avital•5mo ago

I think the Pro plan is $200/mo for everyone? (But honestly I don't know the GPU cost and I'm interested in this question)

yieldcrv•5mo ago

I thought you had privileged and complementary access from working there

macrolime•5mo ago

Sounds like you're manually doing something that could form the basis of further reinforcement learning.

Nudging the UI slightly for this exact flow could generate good training data.

rmonvfer•5mo ago

I was a Plus subscriber and upgraded to Pro just to test Codex, and at least in my experience, it’s been pretty underwhelming.

First, I don’t think they got the UX quite right yet. Having to wait for an undefined amount of time before getting a result is definitely not the best, although the async nature of Codex seems to alleviate this issue (that is, being able to run multiple tasks at once).

Another thing that bugs me is having to define an environment for the tool to be useful. This is very problematic because AFAIK, you can’t spin up containers that might be needed in tests, severely limiting its usefulness. I guess this will eventually change, but the fact that it’s also completely isolated from the internet seems limiting, as one of the reasons o3 is so powerful in ChatGPT is because it can autonomously research using the web to find updated information on whatever you need.

For comparison, I also use Claude a lot, and I’ve found it to work really well to find obscure bugs in a somewhat complex React application by creating a project and adding the GitHub repo as a source. What this allows me is to have a very short wait time, and the difference with Codex is just night and day. Gemini also allows you to do this now, and it works very well because of its massive context window.

All that being said, I do understand where OpenAI is going with this. I guess they want to achieve something like a real coworker (they even say that in their promotional videos for Codex) because you are supposed to give tasks to Codex and wait until it’s done, like a real human, but again, IMHO, it’s too “pull-request-focused”

I guess I’ll be downgrading to Plus again and wait a little to see where this ends up.

anxman•5mo ago

It really needs container support

IanCal•5mo ago

I agree on the UX. A few basic things seem totally broken.

The flow of connecting a github account works, then disconnects, sometimes doesn't work, sometimes just errors. I can't install things that I could yesterday and my environment is just... broken? I have two versions of a repo and it works in only one.

Speed is a big thing. Not the llm stuff so much, but the setup and everything around it for each step.

Not having search cripples some cases where O3 seems incredible.

but there's a lot of places this feels like it can land tasks that often wouldn't get done. A near infinite army of juniors who can take on the lots of tiny tasks in 15-20 minutes is great. Fix some typos, add a few util functions (a task I have right now running), I even just asked it to add new endpoint to a server and it added it, migrations needed, tests and more and seems alright.

The ideal workflow in a way here is that the people asking for these things get to tag the ticket to codex/whatever, they run off and do the thing, PR lands and discussion and changes happen there, demo envs are setup and then someone can check and approve it.

edit -

To be fair, I also used firebase studio and that was worse. Blank screens, errors in the console, when I refreshed and moved around and got an actual page, it ended up failing to setup firebase. UI for editing and code totally failed after that and the explanations for how to fix it I was linked to I couldn't do.

ifwinterco•5mo ago

It's a shame nobody has invented some sort of computerised intelligence that understands code and could fix some of those bugs. Ah well

alexjplant•5mo ago

> AFAIK, you can’t spin up containers that might be needed in tests, severely limiting its usefulness.

This is what's blocking me right now. I couldn't find any documentation on whether they allow Docker-in-Docker which typically means that the answer is "no". Since I'm building an AWS-native app I use LocalStack for end-to-end tests which requires a container engine. Codex not having it is a showstopper.

mdaniel•5mo ago

This might not help you but to the very best of my knowledge localstack can operate over the network just fine and I am pretty sure it has a reset endpoint for zeroing its state (I think it's this https://github.com/localstack/localstack/blob/v4.4.0/localst... )

The other alternative is that I've seen folks mention systemd-nsspawn as a form of isolation if that's what your using docker for (but I've never tried it myself)

ramesh31•5mo ago

Needs checkpointing. A full git commit is too much... commitment. Often you'll go down a bad path with agentic codegen that just falls apart, and you wont know where you wanted to return to until you're there. I'm very skeptical of the "automated PR" solutions at the moment. Too much time and money is lost to trust singleshot yet. And if you still need a human in the loop, best to do it in realtime with constant feedback, i.e. cybernetics not automata.

teekert•5mo ago

“As I wrote about in Walking and talking with AI in the woods, ideally I'd like to start my morning in an office, launch a bunch of tasks, get some planning out of the way, and then step out for a long walk in nature.”

Wouldn’t we all want that, but it sounds like you can leave task launching and planning to an AI and go find another career.

bathtub365•5mo ago

Is there anywhere that lists what languages this supports? They aren’t listed in the product announcement or in this review, and the review examples seem to mostly be fixing typos on webpages.

yieldcrv•5mo ago

> Codex then clones your repositories into its own sandboxes so it can run commands and create branches on your behalf.

Slurping up trade secrets

but maybe I'll sound like the people that are afraid of using github and other cloud git protocols

interesting crossroads

theowijrhrjrj48•5mo ago

Sounds like a gptel-tool one can whip up in a week.

turing_complete•5mo ago

Is codex-1 no longer available through the API?

JR1427•5mo ago

I've found it really helpful in rummaging around an unfamiliar codebase, and pointing me to relevant parts of it.

The application of patches is hit and miss. If there changes are across multiple files, I find it gets stuck going in circles.

But still been a definite net positive in terms of productivity.

ryanackley•5mo ago

If you're building a React app using a popular UI framework, AI will seem like magic at how well it one-shots things.

To the author's point about one-shotting. I think it will be a real challenge pushing an AI coding workflow forward because of this problem. In my experience, AI seems to fall off a cliff when you ask it to write code using more obscure libraries and frameworks. It will always hallucinate something rather than admitting it has no knowledge of how something works.

IanCal•5mo ago

I've had better success with things like o3 with search, because it can actually go and read docs to help fix problems. It helped me dig through matrix specs, proposals and PRs and while the first suggestion didn't work (I thought it would have done) it ended up finding proof that only part of that got merged and found how to enable the experimental side that allowed the other. The iteration of searching and going through things was incredibly helpful. Probably saved me a good few hours or meant I was able to do this at all.

jFriedensreich•5mo ago

The "phone and work away from desk" point struck me as absurd. If anything work is pushed to code review and testing which mostly require even more screen estate than coding itself.

Postgres Internals Hiding in Plain Sight

Why to use vector graphics everywhere?

How to trade your $214,000 cybersecurity job for a jail cell

Why is Zig so Cool?

Most Starred GitHub Repo

Show HN: DeepShot – NBA game predictor with 70% accuracy using ML and stats

FAA restricts commercial rocket launches indefinitely due to air traffic risks

Mapnitor – Simple IP Monitoring Tool

Mind captioning: Evolving descriptive text of mental content of brain activity

AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

Dating trends reached new lows this year. ‘throning,’ ‘Shrekking,’ 'Banksying’?

AGI will be achieved in someone's basement

Trench Crusade Rules

The Multi-Party Dilemma in Action: AWS Outage Network Graph

Show HN: Rankly – The only AEO platform to track AI visibility and conversions

Semble – A social knowledge network for researchers built on ATproto

Study Finds Around a Quarter of Polymarket Trades Are Fake

Itiner-e – an open digital dataset of roads in the Roman Empire

You need to become a full stack person

AMD 8745hs vs. Apple M5

CyberGhost DMCAs Our Story About Their Bogus DMCA (Yes, Really)

OpenAI's Bailout Blunder: How a CFO's Words Ignited a Firestorm

AI-powered proptech got into YC without writing a single line of code

The Gnome Village Threads Fight. Gnomes Cooperate

How a devboard works (and how to make your own)

He Jiankui PhD Thesis: Spontaneous Emergence of Hierarchy in Biological Systems

The Rise of Parasitic AI

Google to Build AI Data Center on Christmas Island

Show HN: My personal Gerrit dash: can you improve it?

Biology Is Getting Faster Cheaper and Weirder

Postgres Internals Hiding in Plain Sight

Why to use vector graphics everywhere?

How to trade your $214,000 cybersecurity job for a jail cell

Why is Zig so Cool?

Most Starred GitHub Repo

Show HN: DeepShot – NBA game predictor with 70% accuracy using ML and stats

FAA restricts commercial rocket launches indefinitely due to air traffic risks

Mapnitor – Simple IP Monitoring Tool

Mind captioning: Evolving descriptive text of mental content of brain activity

AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

Dating trends reached new lows this year. ‘throning,’ ‘Shrekking,’ 'Banksying’?

AGI will be achieved in someone's basement

Trench Crusade Rules

The Multi-Party Dilemma in Action: AWS Outage Network Graph

Show HN: Rankly – The only AEO platform to track AI visibility and conversions

Semble – A social knowledge network for researchers built on ATproto

Study Finds Around a Quarter of Polymarket Trades Are Fake

Itiner-e – an open digital dataset of roads in the Roman Empire

You need to become a full stack person

AMD 8745hs vs. Apple M5

CyberGhost DMCAs Our Story About Their Bogus DMCA (Yes, Really)

OpenAI's Bailout Blunder: How a CFO's Words Ignited a Firestorm

AI-powered proptech got into YC without writing a single line of code

The Gnome Village Threads Fight. Gnomes Cooperate

How a devboard works (and how to make your own)

He Jiankui PhD Thesis: Spontaneous Emergence of Hierarchy in Biological Systems

The Rise of Parasitic AI

Google to Build AI Data Center on Christmas Island

Show HN: My personal Gerrit dash: can you improve it?

Biology Is Getting Faster Cheaper and Weirder

OpenAI Codex hands-on review

Comments