For instance I ask it to make a change and as part of the output it makes a bunch of value on the class nullable to get rid of compiler warnings.
This technically "works" in the sense that it made the change I asked for and the code compiles but it's clearly incorrect in the sense that we've lost data integrity. And there's a bunch of other examples like that I could give.
If you just let it run loose on a codebase without close supervision you'll devolve into a mess of technical debt pretty quickly.
"Because we’re doing a fair amount of dynamic/Reflect.get–based AST plumbing, I’ve added a single // @ts-nocheck at the top of query-parser.ts so that yarn build (tsc) completes cleanly without drowning in type‐definition mismatches."
Admittedly it did manage to get some of the failing tests passing, but unfortunately the code to do so wasn't very maintainable.
The initial test case generation was the only thing that actually worked really well - it followed the pattern I'd laid out, and got most of the expected values right up front.
- it's a GREAT oneshot coding model (in the pod we find out that they specifically finetuned for oneshotting OAI SWE tasks, eg prioritized over being multiturn)
- however comparatively let down by poorer integrations (eg no built in browser, not great github integration - as TFA notes "The current workflow wants to open a fresh pull request for every iteration, which means pushing follow-up commits to an existing branch is awkward at best." - yeah this sucks ass)
fortunately the integrations will only improve over time. i think the finding that you can do 60 concurrent Codex instances per hour is qualitatively different than Devin (5 concurrent) and Cursor (1 before the new "background agents").
btw
> I haven't yet noticed a marked difference in the performance of the Codex model, which OpenAI explains is a descendant of GPT-3 and is proficient in more than 12 programming languages.
incorrect, its an o3 finetune.
1. Cannot git fetch and sync with upstream, fixing any integration bugs; 2. Cannot pull in new library as dependency and do integration evaluations.
Besides that, cannot apt install in the setup script is annoying (they blocked the domain to prevent apt install I believe).
The agent itself is a bit meh, often opt-to git grep rather than reading all the source code to get contextual understanding (from what the UI has shown).
This is Open AI's fault (and literally every AI company is guilty of the same horrid naming schemes). Codex was an old model based on GPT-3, but then they reused the same name for both their Codex CLI and this Codex tool...
I mean, just look at the updates to their own blog post, I can see why people are confused.
https://openai.com/index/openai-codex/
Edit:
Google just did it too. "Gemini Ultra" is both a model (https://deepmind.google/models/gemini/ultra/) and their new top-tier subscription plan (a la Open AI's Pro plan). Why is this so difficult?
There are so many content changes or small CSS fixes (anyway you would verify that it was fixed by looking at it visually) where I really don't want to be bothered being involved in the writing of it, but I'm happy to do a code review.
Letting a non-dev see the ticket, start off a coding thing, test if it was fixed, and then just say "yea this looks good" and then I look at the code, seems like good workflow for most of the minor bugs/enhancements in our backlog.
This almost seems like this is a funnel to force people to become software engineers.
Like:
- When making CSS changes, make sure that the code is responsive. Add WCAG 2.0 attributes to any HTML markup.
- When making changes, run <some accessibility linter command> to verify that the changes are valid.
etc.
The non-dev doesn't need to know/care.
It'll probably get there eventually, but today these are not things solvable with context.
This feels so hopelessly optimistic to me, because "effectively away from our desks" for most people will mean "in the unemployment line"
Are you pretending that automation doesn’t take away human jobs?
We should welcome automation and efficiency, but also address the situation of the "losers" of the development and not just expect the invisible hand will sort everything out.
I'd like you to be right, but I live in society where joy at work is often considered antithetical to productivity. No matter how much more productive I get, that space is used to fill in more productivity. We'll need more than tooling to stop this.
With Codex and Claude Code, these model agents are cars.
Some of horses will become drivers of cars and some of us will no longer be needed to pull wagons and will be out of a job.
Is that the proper framing?
An amusing image, but your analogy lost me here.
I think CEOs or PMs or Founders are like horse jockeys. Devs are like horses. (Some of them are both the jockey and the horse).
AI is a car. CEO or PM or Founder might smoothly swap out the horse for a car and continue on with little change.
For the horse to become a driver of a car is a more difficult challenge, but not impossible. It needs to evolve.
Also there's a matter of taste, as commented above, the best way to use these is going to be running multiple runs at once (that's going to be super expensive right now so we'll need inference improvements on today's SOTA models to make this something we can reasonably do on every task). Then somebody needs to pick which run made the best code, and even then you're going to want code review probably from a human if it's written by machine.
Trusting the machine and just vibe coding stuff is fine for small projects or maybe even smaller features, but for a codebase that's going to be around for a while I expect we're going to want a lot of human involvement in the architecture. AI can help us explore different paths faster, but humans need to be driving it still for quite some time - whether that's by encoding their taste into other models or by manually reviewing stuff, either way it's going to take maintenance work.
In the near-term, I expect engineering teams to start looking for how to leverage background agents more. New engineering flows need to be built around these and I am bearish on the current status quo of just outsource everything to the beefiest models and hope they can one-shot it. Reviewing a bunch of AI code is also terrible and we have to find a better way of doing that.
I expect since we're going to be stuck on figuring out background agents for a while that teams will start to get in the weeds and view these agents as critical infra that needs to be designed and maintained in-house. For most companies, foundation labs will just be an API call, not hosting the agents themselves. There's a lot that can be done with agents that hasn't been explored much at all yet, we're still super early here and that's going to be where a lot of new engineering infra work comes from in the next 3-5 years.
Now you could argue that any non technical person could just oversee the agents instead. Possibly. Though in my experience, humans like to have other humans they trust oversee and understand important stuff for them.
If you’ve seen the work hours and work ethic of farmers, it’s safe to say that most of those people got other jobs that take far less work than farmers did/do.
Closer to our field, I think we’d have far worse work lives (fewer of us employed and much lower pay) if we had to code everything in assembler still. The creation of more powerful abstractions and languages allowed more of us to become software devs and make a living this way than if all we had were the less productive tools of the early days of computing.
We've used our software development skills to automate other people out of work for what can be argued to be literally decades. Each time we did it, we certainly expected that the people affected would find other work. New jobs were created. The world didn't end. I honestly don't think it would be that much worse this time.
Uh.. I'm having trouble considering this as a serious question. It's objectively going to lead to them being in a worse situation. Mostly irrelevant resume and needing to re-skill into something and start from the bottom.. out of a well paid career that many enjoy and find fulfilling
My question wasn't an ethical one. It's why are the people that are the target of this automation happy about the progress, to the point of trying to push it forward faster, cheering it on
But the automation is not the problem, it's the economic structure in which increased efficiency makes a lot of people worse off.
And that's the shitty part of the job, and everyone should be uncomfortable with it. I haven't literally automated anyone out of a job (that I know), but I definitely did not like finding out (after the fact) that one project was meant to enable a large offshoring effort.
> Each time we did it, we certainly expected that the people affected would find other work.
I do not expect that. That's a comforting lie people tell themselves.
> New jobs were created. The world didn't end. I honestly don't think it would be that much worse this time.
It didn't end, but it often got significantly worse for some. If the AI hype pans out, it's going to get significantly worse for software engineers. Your "newly created job," if it exists, will likely pay out a lot less that you're used to. At best, you'll get knocked down to the bottom of the career ladder.
It's a mistake to think about things in aggregate like you're doing. It's easy to hide inconvenient truths.
I expect that anyone who is a skilled dev today will be fine. Expectations and competition might be higher, but so will production and value creation.
I think the demand will come, just as Excel didn’t put finance people out of jobs in aggregate.
Software engineers are dumb. Really dumb.
- Always run more than one rollout of the same prompt -- they will turn out different
- Look through the parallel implementations, see which is best (even if it's not good enough), then figure out what changes to your prompt would have helped nudge towards the better solution.
- In addition, add new modifications to the prompt to resolve the parts that the model didn't do correctly.
- Repeat loop until the code is good enough.
If you do this and also split your work into smaller parallelizable chunks, you can find yourself spending a few hours only looping between prompt tuning and code review with massive projects implemented in a short period of time.
I've used this for "API munging" but also pretty deep Triton kernel code and it's been massive.
Seriously, everyone should get good at fixing bugs. LLMs are terrible at it when it’s slightly non-obvious and since everyone is focusing on vibe coding, I doubt they’ll get any better.
Every iteration I make on the prompts only make the request more specified and narrow and it's always gotten me closer to my desired goal for the PR. (But I do just ditch the worse attempts at each iteration cycle)
Is it possible that reasoning models combined with the actual interaction with the real codebase makes this "prompt fragility" issue you speak of less common?
How can non-technical people tell what's "best"? You need to know what you're doing at this point, look for the right pitfalls, inspect everything in detail... this right here is the entire counter-argument for LLMs eliminating SWE jobs...
holy gaslighting Christ have some links, lots of people think that
https://www.reddit.com/r/ITCareerQuestions/comments/126v3pm/...
https://medium.com/technology-hits/the-death-of-coding-why-c...
https://medium.com/@TheRobertKiyosaki/are-programmers-obsole...
https://www.forbes.com/sites/hessiejones/2024/09/21/the-auto...
and on and on, endless thinkpieces about this. Certainly SOMEONE, someone with a lot of money, thinks software engineers are imminently replaceable.
> until the singularity curve starts looking funny.
well there's absolutely no evidence whatsoever that we've made any progress to bringing about Kurzweil's God so I think regardless of what Sam Altman wants you to believe about "general AI" or those thinkpieces, experts are probably okay.
Coding/engineering/etc is all problem solving in a strucutred manner.
That skill is not going anywhere
I wouldn't have to listen to people talk about it all the time if nobody thought it was true
The verb you use when you only need to produce boilerplate.
> Prompt™
The verb you use when it's time to innovate.
I'm not sure a tool that positions itself as a "programmer co-worker" is aiming to be useful to non-technical people. I've said it before, but I don't think LLMs currently are at the stage where they enable you to do things you have 0 experience in, but rather can help you speed up working through things you are familiar with. I think people who claim LLMs will completely replace jobs are hyping the technology without really understanding it.
For example, I'm a programmer, but never done any firmware flashing with UART before via a USB flasher. Today I managed to do that in 1-2 hours thanks to ChatGPT helping me out understanding how to do it. If I'd do it completely on my own, I'm sure it would have taken me at least the full day to do so, instead of the time it took. I was able to see when it got mislead, and could rewrite/redirect from there on, but someone with 0 programming experience, probably wouldn't have been able to.
There are also cases where it fails to do what I wanted, and then I just stop trying after a few iterations. But I've learned what to expect it to do well in and I am mostly calibrated now.
The biggest difference is that I can have agents working on 3-4 parallel tasks at any given point.
IMO just keeping an IDE window open and babysitting an agent while it works is less productive than just writing the code mostly yourself with AI assistance in the form of autocomplete and maybe highly targeted oneshots using manual context provided "Edit" mode or inline prompting.
My company is dragging their feet on AI governance and let the OpenAI key I was using expire, and what I noticed was that my output of small QoL PRs and bugfixes dropped drastically because my attention remains focused on higher impact work.
First, I don’t think they got the UX quite right yet. Having to wait for an undefined amount of time before getting a result is definitely not the best, although the async nature of Codex seems to alleviate this issue (that is, being able to run multiple tasks at once).
Another thing that bugs me is having to define an environment for the tool to be useful. This is very problematic because AFAIK, you can’t spin up containers that might be needed in tests, severely limiting its usefulness. I guess this will eventually change, but the fact that it’s also completely isolated from the internet seems limiting, as one of the reasons o3 is so powerful in ChatGPT is because it can autonomously research using the web to find updated information on whatever you need.
For comparison, I also use Claude a lot, and I’ve found it to work really well to find obscure bugs in a somewhat complex React application by creating a project and adding the GitHub repo as a source. What this allows me is to have a very short wait time, and the difference with Codex is just night and day. Gemini also allows you to do this now, and it works very well because of its massive context window.
All that being said, I do understand where OpenAI is going with this. I guess they want to achieve something like a real coworker (they even say that in their promotional videos for Codex) because you are supposed to give tasks to Codex and wait until it’s done, like a real human, but again, IMHO, it’s too “pull-request-focused”
I guess I’ll be downgrading to Plus again and wait a little to see where this ends up.
Wouldn’t we all want that, but it sounds like you can leave task launching and planning to an AI and go find another career.
Slurping up trade secrets
but maybe I'll sound like the people that are afraid of using github and other cloud git protocols
interesting crossroads
maxwellg•7h ago
zackproser•6h ago
I feel it will get there in short order..but for the time being I feel that we'll be doing some combination of scattershot smaller & maintenance tasks across Codex while continuing to build and do serious refactoring in an IDE...