OpenAI scores gold in one of the top programming competitions

https://www.msn.com/en-xl/news/other/openai-scores-gold-in-one-of-the-world-s-top-programming-competitions/ar-AA1KknUL

13•energy123•5mo ago

Comments

NitpickLawyer•5mo ago

So in the past month we've had

- gold at IMO

- gold at IoI

- beat 9/10 humans in atcode heuristics

- longer context, better models, routing calls to cheaper models, 4-6x cheaper inference for 90% of the top models capabilities

- longer agentic sessions while being coherent/solving tasks (30-90min)

Yet every other post here and there are about "bubble this", "winter that", "plateauing this", "wall that"...

Are we in the denial stage, or bargaining stage? Can't quite tell...

robertlagrant•5mo ago

You might've said the same thing about self-driving cars five years ago, or chess even longer ago. It turns out chess was soluble, so the nay-sayers were wrong, but self-driving cars aren't soluble (yet) so the yay-sayers were wrong.

energy123•5mo ago

People use low-compute models in their day to day jobs. They're not exposed to how good the very-high-compute runs are doing at the moment.

machiaweliczny•5mo ago

This. My younger brother thinks it’s crap but if you know state of the art + research it seems like things still are moving quite fast. Also tons of product work on top already.

energy123•5mo ago

Even gpt-5 on "high" reasoning effort (which is likely higher than what people get in the Plus subscription; that's most likely "medium") is very, very low compute compared to the top runs behind IOI/IMO solutions.

Rick76•5mo ago

If that's the case, then why is that, why would OpenAI not want to release their best models when the AI race is still close? I would assume it's due to energy constraints, and if that's true, the opinion that this can't replace people remains valid.

Thermodynamics is the law of laws, unless they invent some kind of ultra-efficient, almost magical computers to run these systems, it's simply not economical yet.

energy123•5mo ago

It's not a question of whether it's the case. It's confirmed by OpenAI employees on Twitter.

The reasons could be that it's new (they did say they plan to release eventually but not soon), or that it's too heavily scaffolded for the task and not sufficiently general.

tyleo•5mo ago

But can it maintain my legacy crud app with no tests, millions of LoC, long compile times?

One day but not yet. Beyond pure capabilities the companies making AI don’t seem to have any sort of moat so it’s a $$$ incinerator for them so far.

Like the late 90s internet I suspect we’re in a bubble. But also like the late 90s internet I suspect there’s more in store here in the future.

aleph_minus_one•5mo ago

> Yet every other post here and there are about "bubble this", "winter that", "plateauing this", "wall that"...

> Are we in the denial stage, or bargaining stage? Can't quite tell...

I can tell quite clearly that even assuming that the models are not rather "fine-tuned" to win these competitions, these achievements neither transfer to the kind of coding that I do at work, nor to the one that I do privately at night.

At work, a lot of what needs to be done is

1. asking people who are knowledgeable about the business logic why things were implemented this way (there often exist good reasons, which nevertheless are often quite subtle).

2. If some new requirements comes up, think deeply into how these new requirements fit into the huge legacy codebase (I am allowed to change things here as necessary (which is an enormous concession that is uncommon in this industry), but my code changes should really never ever cause the software to produce wrong results or break business-critical workflows, because such failures can cost my employer quite some money, or increase the workload of already overworked colleagues (they will then legitimately hate me :-( ) who in specific months have to work under very tight deadlines. What is a "business-critical workflow" here that should better never be broken? The answer requires understanding the very demanding users over many, many years (believe me: it is really subtle)).

I cannot imagine how AIs could help with this.

Privately, I tend to write very experimental code for which one can very likely not find similar code on the internet. Think into the direction of turning some deep scientific results into more "mainstream" code or turning my avant-garde thoughts about some deep problems into code so that one can do experiments to see whether my ideas actually work.

Again something where AIs can barely help.

fragmede•5mo ago

I'm not arguing that they'd necessarily be any good at it, but hooking the LLM into your company's ticketing and communications platform seems like an incredibly obvious way to address both your points that I'm not sure why it's unimaginable. Not possible with current SOTA, but it shouldn't be inconceivable either.

bamboozled•5mo ago

Could we just be somewhere in the middle? Amazing models which have been tuned to win a certain comp and given a lot more compute than is feasible for every day usage, yet still the daily general models are still useful but not AGI yet?

NitpickLawyer•5mo ago

Yeah, I agree. I dislike both the doomer content and the singularity content.

> yet still the daily general models are still useful

Yup. I just had a ~30min session where gpt5-mini did everything I needed, almost 0shot. Nothing complicated, but production code. Wanted to refactor a small service, wrote a ~4 sentence goal, it went on and asked for read permissions on the repo, understood the API requirements perfectly, wrote itself a plan, did the refactoring, all the previous tests pass (confirmed manually just to be sure), all good. Would have taken me maybe 4 hours?

fasterik•5mo ago

I don't think it's crazy to talk about plateaus, it just depends on what domain we're talking about. Performance on olympiad-style problems doesn't necessarily translate into success in research, or industry, or creative pursuits. We know this is true for humans, then add on to that all the usual problems with LLMs like hallucinations and you can see why some people are still skeptical.

I'm still in the "wait and see" stage. Maybe throwing more compute at the problem will solve it, but maybe not. I would like to see benchmarks that take a more project-based approach, e.g. tell the LLM to go work on something complicated and ambiguous for a week and see what it comes up with.

SideburnsOfDoom•5mo ago

How many of the answers were verbatim in the training data?

animal531•5mo ago

I use GPT about daily now and have noticed a funny thing, which is to be expected really.

I can ask it to help me code for example a physics engine, so we're talking really hard and intricate code and it'll come up with some amazing optimizations, we're talking (recent) research paper level implementations.

Then I ask it to work on something that's relatively trivial, let's say we need a flowfield. It'll think and reason about it just as well as in the first example, but then it'll start spitting out a lot of subpar code. Its error rate will increase 10x while the global cohesiveness of the produced code will be substantially worse.

As to why that's happening, maybe its being trained on a lot more as well as worse examples of the second, whereas the first is relatively "pure".

These programming competitions are pretty much the same thing in my opinion. For us humans its a hard challenge, but in general they're asking the same-ish questions, just in different formats. They should add some questions where the participant has to invent something new, or alternatively use two or more existing concepts in a totally novel fashion.

Welfare states build financial markets through social policy design

Market orientation and national homicide rates

California urges people avoid wild mushrooms after 4 deaths, 3 liver transplants

Matthew Shulman, co-creator of Intellisense, died 2019 March 22

Show HN: SuperLocalMemory – AI memory that stays on your machine, forever free

Show HN: Pyrig – One command to set up a production-ready Python project

Fast Response or Silence: Conversation Persistence in an AI-Agent Social Network [pdf]

C and C++ dependencies: don't dream it, be it

Show HN: Vbuckets – Infinite virtual S3 buckets

Open Molten Claw: Post-Eval as a Service

New York Budget Bill Mandates File Scans for 3D Printers

The End of Software as a Business?

Exploring 1,400 reusable skills for AI coding tools

Show HN: A unique twist on Tetris and block puzzle

The logs I never read

How to use AI with expressive writing without generating AI slop

Show HN: LinkScope – Real-Time UART Analyzer Using ESP32-S3 and PC GUI

Cppsp v1.4.5–custom pattern-driven, nested, namespace-scoped templates

The next frontier in weight-loss drugs: one-time gene therapy

At Age 25, Wikipedia Refuses to Evolve

Show HN: ReviewReact – AI review responses inside Google Maps ($19/mo)

Why AlphaTensor Failed at 3x3 Matrix Multiplication: The Anchor Barrier

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

Hello

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

Welfare states build financial markets through social policy design

Market orientation and national homicide rates

California urges people avoid wild mushrooms after 4 deaths, 3 liver transplants

Matthew Shulman, co-creator of Intellisense, died 2019 March 22

Show HN: SuperLocalMemory – AI memory that stays on your machine, forever free

Show HN: Pyrig – One command to set up a production-ready Python project

Fast Response or Silence: Conversation Persistence in an AI-Agent Social Network [pdf]

C and C++ dependencies: don't dream it, be it

Show HN: Vbuckets – Infinite virtual S3 buckets

Open Molten Claw: Post-Eval as a Service

New York Budget Bill Mandates File Scans for 3D Printers

The End of Software as a Business?

Exploring 1,400 reusable skills for AI coding tools

Show HN: A unique twist on Tetris and block puzzle

The logs I never read

How to use AI with expressive writing without generating AI slop

Show HN: LinkScope – Real-Time UART Analyzer Using ESP32-S3 and PC GUI

Cppsp v1.4.5–custom pattern-driven, nested, namespace-scoped templates

The next frontier in weight-loss drugs: one-time gene therapy

At Age 25, Wikipedia Refuses to Evolve

Show HN: ReviewReact – AI review responses inside Google Maps ($19/mo)

Why AlphaTensor Failed at 3x3 Matrix Multiplication: The Anchor Barrier

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

Hello

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

OpenAI scores gold in one of the top programming competitions

Comments