frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

OpenAI Codex Review

https://zackproser.com/blog/openai-codex-review
63•fragmede•3h ago

Comments

maxwellg•2h ago
Being able to make quick changes across a ton of repos sounds awesome. I help maintain a ton of example apps, and doing things like updating a README to conform to a new format, or changing a link, gets pretty tedious when there are 20 different places to do it. If I could delegate all that busywork to Codex and smash the merge button later I would be happy.
zackproser•2h ago
Me too :)

I feel it will get there in short order..but for the time being I feel that we'll be doing some combination of scattershot smaller & maintenance tasks across Codex while continuing to build and do serious refactoring in an IDE...

datadrivenangel•2h ago
40-60% success rate for smaller things is pretty good. Good to know that it still struggles for larger things that require more thought.
CSMastermind•39m ago
In my testing with it anything that requires a bit of critical thought gets completely lost. It's about on par with a bad junior engineer at this point.

For instance I ask it to make a change and as part of the output it makes a bunch of value on the class nullable to get rid of compiler warnings.

This technically "works" in the sense that it made the change I asked for and the code compiles but it's clearly incorrect in the sense that we've lost data integrity. And there's a bunch of other examples like that I could give.

If you just let it run loose on a codebase without close supervision you'll devolve into a mess of technical debt pretty quickly.

swyx•2h ago
i shared my review inside of the pod with the team (https://latent.space/p/codex) but basically:

- it's a GREAT oneshot coding model (in the pod we find out that they specifically finetuned for oneshotting OAI SWE tasks, eg prioritized over being multiturn)

- however comparatively let down by poorer integrations (eg no built in browser, not great github integration - as TFA notes "The current workflow wants to open a fresh pull request for every iteration, which means pushing follow-up commits to an existing branch is awkward at best." - yeah this sucks ass)

fortunately the integrations will only improve over time. i think the finding that you can do 60 concurrent Codex instances per hour is qualitatively different than Devin (5 concurrent) and Cursor (1 before the new "background agents").

btw

> I haven't yet noticed a marked difference in the performance of the Codex model, which OpenAI explains is a descendant of GPT-3 and is proficient in more than 12 programming languages.

incorrect, its an o3 finetune.

canadiantim•1h ago
How do you find it compares to Claude Code?
viscanti•1h ago
It's much more conservative in the scope of task it will attempt and it's much slower. You need to fire and forget several parallel tasks because you'll be waiting 10+ minutes before you get anything you can review and give feedback on.
swyx•1h ago
right now apples and oranges literally only because 1) unlimited unmetered use and 2) not in browser so async and parallel. like that stuff just trumps actual model and agent harness differences because it removes all barriers from thought to code.
liuliu•1h ago
The particularly integration pain point to me is about network access, that prohibits several banal tasks to be offloaded to codex:

1. Cannot git fetch and sync with upstream, fixing any integration bugs; 2. Cannot pull in new library as dependency and do integration evaluations.

Besides that, cannot apt install in the setup script is annoying (they blocked the domain to prevent apt install I believe).

The agent itself is a bit meh, often opt-to git grep rather than reading all the source code to get contextual understanding (from what the UI has shown).

andrewmunsell•4m ago
> incorrect, its an o3 finetune.

This is Open AI's fault (and literally every AI company is guilty of the same horrid naming schemes). Codex was an old model based on GPT-3, but then they reused the same name for both their Codex CLI and this Codex tool...

I mean, just look at the updates to their own blog post, I can see why people are confused.

https://openai.com/index/openai-codex/

atonse•1h ago
I'm actually curious about using this sort of tool to allow non-devs to make changes to our code.

There are so many content changes or small CSS fixes (anyway you would verify that it was fixed by looking at it visually) where I really don't want to be bothered being involved in the writing of it, but I'm happy to do a code review.

Letting a non-dev see the ticket, start off a coding thing, test if it was fixed, and then just say "yea this looks good" and then I look at the code, seems like good workflow for most of the minor bugs/enhancements in our backlog.

SketchySeaBeast•1h ago
Even content changes can require deliberate thought. Any system of decent size is probably going to have upstream/downstream dependencies - adding a field might require other systems to account for it. I guess I can see small CSS changes, but how does the user know when the change is small or "small"?
rgbrgb•1h ago
Perhaps the system could tell them 80% of the time and the reviewer catches the other 20%. An easy heuristic that usually would work in this case is lines of code. It's a classically bad way to measure impact / productivity but it's definitely an indicator and this is probably a rare instance where the measurement would not break efficacy of the metric (Goodhart's law) and might actually improve the situation.
SketchySeaBeast•1h ago
But that's what I mean, when things look small, and are easy to change in the place that it's being asked to be changed, but hidden under the iceberg is a bunch of requirements around that field, things like data stores, or generated pdfs, whether or not that field needs to be added to other calls that aren't in this code base.
rgbrgb•27s ago
[delayed]
ChadMoran•1h ago
People will learn about accessibility, multi-platform (mobile/desktop) and many other gotchas real quick.

This almost seems like this is a funnel to force people to become software engineers.

atonse•1h ago
But these are all things that can be added to context by a dev.

Like:

- When making CSS changes, make sure that the code is responsive. Add WCAG 2.0 attributes to any HTML markup.

- When making changes, run <some accessibility linter command> to verify that the changes are valid.

etc.

The non-dev doesn't need to know/care.

lelandfe•57m ago
There is no robust accessibility linter tool (axe covers only a portion) and you don't want to add ARIA attributes to all HTML markup. Both "accessible" and "responsive" are ultimately subjective, and all code gen tools I've used are more than happy to introduce startling a11y regressions.

It'll probably get there eventually, but today these are not things solvable with context.

dwb•37m ago
Accessibility isn’t something that can be judged by a program, not even close.
micromacrofoot•1h ago
> Codex will support me and others in performing our work effectively away from our desks.

This feels so hopelessly optimistic to me, because "effectively away from our desks" for most people will mean "in the unemployment line"

zackproser•1h ago
Maybe, maybe that's FUD...I can't predict the future.
righthand•22m ago
You can’t predict the future or are choosing to ignore the future?

Are you pretending that automation doesn’t take away human jobs?

ninininino•17m ago
I guess maybe the analogy is we as software devs are all horses.

With Codex and Claude Code, these model agents are cars.

Some of horses will become drivers of cars and some of us will no longer be needed to pull wagons and will be out of a job.

Is that the proper framing?

allturtles•10m ago
> Some of horses will become drivers of cars

An amusing image, but your analogy lost me here.

chw9e•5m ago
Think we've got a long time yet for that. We're going to be writing code a lot faster but getting these things to 90-95% on such a wide variety of tasks is going to be a monumental effort, the first 60-70% on anything is always much easier than the last 5-10%.

Also there's a matter of taste, as commented above, the best way to use these is going to be running multiple runs at once (that's going to be super expensive right now so we'll need inference improvements on today's SOTA models to make this something we can reasonably do on every task). Then somebody needs to pick which run made the best code, and even then you're going to want code review probably from a human if it's written by machine.

Trusting the machine and just vibe coding stuff is fine for small projects or maybe even smaller features, but for a codebase that's going to be around for a while I expect we're going to want a lot of human involvement in the architecture. AI can help us explore different paths faster, but humans need to be driving it still for quite some time - whether that's by encoding their taste into other models or by manually reviewing stuff, either way it's going to take maintenance work.

In the near-term, I expect engineering teams to start looking for how to leverage background agents more. New engineering flows need to be built around these and I am bearish on the current status quo of just outsource everything to the beefiest models and hope they can one-shot it. Reviewing a bunch of AI code is also terrible and we have to find a better way of doing that.

I expect since we're going to be stuck on figuring out background agents for a while that teams will start to get in the weeds and view these agents as critical infra that needs to be designed and maintained in-house. For most companies, foundation labs will just be an API call, not hosting the agents themselves. There's a lot that can be done with agents that hasn't been explored much at all yet, we're still super early here and that's going to be where a lot of new engineering infra work comes from in the next 3-5 years.

avital•51m ago
I work at OpenAI (not on Codex) and have used it successfully for multiple projects so far. Here's my flow:

- Always run more than one rollout of the same prompt -- they will turn out different

- Look through the parallel implementation, see which is best (even if it's not good enough), then figure out what changes to your prompt would have helped nudge towards the better solution.

- In addition, add new modifications to the prompt to resolve the parts that the model didn't do correctly.

- Repeat loop until the code is good enough.

If you do this and also split your work into smaller parallelizable chunks, you can find yourself spending a few hours only looping between prompt tuning and code review with massive projects implemented in a short period of time.

I've used this for "API munging" but also pretty deep Triton kernel code and it's been massive.

owebmaster•21m ago
Can it be used to fix bugs? Because the ChatGPT web app is full of them and I don't think they are getting fixed. Pasting big amounts of text freezing the tab is one of them.

Sockudo: High-Performance Pusher-Compatible WebSockets Built with Rust

https://sockudo.app/
1•kondro•24s ago•0 comments

The Meritocracy to Eugenics Pipeline

https://pluralistic.net/2025/05/20/big-cornflakes-energy/#caliper-pilled
1•rbanffy•36s ago•0 comments

Geoship – Bioceramic Geodesic Domes

https://www.geoship.is
1•jinjin2•38s ago•0 comments

EV battery maker pulled off 2025's biggest IPO

https://restofworld.org/2025/catl-stock-ipo-ev-battery/
1•donohoe•42s ago•0 comments

UN warns 14,000 babies could die within 48 hours under Israel siege of Gaza

https://www.independent.co.uk/news/world/middle-east/gaza-strip-babies-aid-israel-b2754287.html
3•pera•1m ago•0 comments

ChatGPT Helps Students Feign ADHD: An Analogue Study on AI-Assisted Coaching

https://link.springer.com/article/10.1007/s12207-025-09538-7
2•paulpauper•2m ago•0 comments

William James's Grief (2024)

https://www.frontporchrepublic.com/2024/12/william-jamess-grief/
2•benbreen•2m ago•0 comments

Analog - open source calendar

https://github.com/JeanMeijer/analog
1•JamesAdir•3m ago•0 comments

Some Miami schools are embracing AI

https://www.nytimes.com/2025/05/19/technology/ai-miami-schools-google-gemini.html
1•paulpauper•4m ago•0 comments

Show HN: AI agents that crawl the open, deep and dark web for real-time

https://goldenowl.ai
1•sabi_soltani•4m ago•0 comments

Fingers Wrinkle in the Same Pattern Every Time After Long Exposure to Water

https://www.discovermagazine.com/health/your-fingers-wrinkle-in-the-same-pattern-every-time-after-long-exposure-to
1•paulpauper•4m ago•0 comments

Malaysia Downplays Huawei Deal as US Checks China's AI Reach

https://www.bloomberg.com/news/articles/2025-05-20/malaysia-downplays-huawei-deal-as-us-aims-to-curb-china-ai-power
1•pdyc•4m ago•0 comments

Building AI-Powered Apps with Firebase AI Logic

https://firebase.blog/posts/2025/05/building-ai-apps/
1•LyalinDotCom•5m ago•0 comments

Towards a JSON API for the JDK

https://mail.openjdk.org/pipermail/core-libs-dev/2025-May/145905.html
1•jgon•5m ago•0 comments

Our vision for building a universal AI assistant

https://blog.google/technology/google-deepmind/gemini-universal-ai-assistant/
1•cuuupid•5m ago•0 comments

Scott Adams, Creator of 'Dilbert,' Says He Has the Same Cancer as Biden

https://www.nytimes.com/2025/05/20/us/scott-adams-dilbert-prostate-cancer-biden.html
1•reaperducer•6m ago•0 comments

What's New with Agents: ADK, Agent Engine, and A2A Enhancements

https://developers.googleblog.com/en/agents-adk-agent-engine-a2a-enhancements-google-io/
1•mrry•7m ago•0 comments

Gemini Diffusion

https://deepmind.google/models/gemini-diffusion/
2•og_kalu•8m ago•1 comments

Google Wants You to Pay $250 for a God-Tier Subscription to AI

https://gizmodo.com/googles-wants-you-to-pay-250-for-a-god-tier-subscription-to-ai-2000604285
1•rntn•8m ago•0 comments

AI Mode in Google Search

https://blog.google/products/search/google-search-ai-mode-update/
1•ChrisArchitect•8m ago•0 comments

Imagen 4

https://deepmind.google/models/imagen/
2•synthwave•10m ago•0 comments

Running the entire HEY test suite on different computers

https://twitter.com/dhh/status/1924832508157329844
1•wmf•10m ago•0 comments

Iterator helpers have become Baseline Newly available

https://web.dev/blog/baseline-iterator-helpers
1•feross•11m ago•0 comments

Veo 3: Video, meet audio

https://deepmind.google/models/veo/
2•meetpateltech•11m ago•0 comments

Thank You for Your Existence

https://daniel.haxx.se/email/2025-05-20.html
2•Vinnl•12m ago•0 comments

Gemini 2.5: Our most intelligent models are getting even better

https://blog.google/technology/google-deepmind/google-gemini-updates-io-2025/
4•meetpateltech•12m ago•0 comments

Introducing Veo 3 and Imagen 4, and a new tool for filmmaking called Flow

https://blog.google/technology/ai/generative-media-models-io-2025/
13•youssefarizk•12m ago•3 comments

Why Are Clinical Trials So Complicated? (2020)

https://www.science.org/content/blog-post/why-clinical-trials-so-complicated
1•Tomte•12m ago•0 comments

Where Did 'Jazz,' the Word, Come From? (2018)

https://www.wbgo.org/music/2018-02-26/where-did-jazz-the-word-come-from-follow-a-trail-of-clues-in-deep-dive-with-lewis-porter
1•Tomte•13m ago•0 comments

Teacher Drag a 6yr-Old with Autism by His Ankle Fed OCR Might Not Do Anything

https://www.propublica.org/article/garrison-school-illinois-autistic-student-dragged-ankle
1•frumper•14m ago•1 comments