The upcoming GPT-3 moment for RL

https://www.mechanize.work/blog/the-upcoming-gpt-3-moment-for-rl/

113•jxmorris12•4d ago

Comments

criemen•6h ago

They have a point about RLs increasing importance. From my outsider perspective, all major advances in model capabilities in the last period of time come from RL, so it's natural to expect that we can "milk" RL more for performance gains. Scaling RL is a natural way to attempt that.

What I don't necessarily see is the generalization factor - say, we improve software engineering and math performance through RL learning (probably easier for software engineering than math due to available training corpus). If that generalization factor doesn't hold, due the economics still work out? An expert-level software model would be useful to our profession, sure, but would it be enough to recoup the training costs if it's not applicable to other industries?

amelius•5h ago

Step 1. Train a VLM to supervise the RL training.

Step 2. Train the RL network. In the mean time drink coffee or work on plan of world domination.

criemen•5h ago

My understanding is that this is essentially how RLHF works, and it doesn't scale. As you run RL for longer, the model will learn how to cheat the imperfections of the grader, instead of getting better at the task at hand. Therefore, to scale RL you really need good graders, and determinism is king.

clbrmbr•4h ago

Do you think constitutional approaches would help here? (Verifiable reward for the main score, but then asking the model to self-critique for security and quality.)

amelius•3h ago

You're talking about training an LLM. I'm talking about training robotic/motor skills and haptic feedback.

mohsen1•5h ago

I’ve been exploring this too, since I rely on LLMs a lot to build software. I’ve noticed that our dev loop-writing, testing-is often mostly human-guided, but language models frequently outperform us in reasoning. If we plug in more automation; MCP tools controlling browsers, documentation readers, requirement analysers, we can make the cycle much more automated, with less human involvement.

This article suggests scaling up RL by exposing models to thousands of environments

I think we can already achieve something similar by chaining multiple agents:

1. A “requirement” agent that uses browser tools to craft detailed specs from docs.

2. A coding agent that sets up environments (Docker, build tools) via browser or CLI.

3. A testing agent that validates code against specs, again through tooling.

4. A feedback loop where the tester guides the coder based on results.

Put together, this system becomes a fully autonomous development pipeline-especially for small projects. In practice, I’ve left my machine running overnight, and these agents propose new features, implement them, run tests, and push to repo once they pass. It works surprisingly well.

The main barrier is cost—spinning up many powerful models is expensive. But on a modest scale, this method is remarkably effective.

YetAnotherNick•4h ago

RL is a training method and it improves the model itself. So basically one step(e.g. successful test run, finding search result) could create positive and negative examples for the other step(e.g. coding agent, search agent). And using this the base itself will improve to satisfy other demands and if it reaches close to 100% accuracy(which I believe it could as models mostly fail due to dumb mistakes in tests), you don't need the testing agent altogether.

kuruczgy•3h ago

> but language models frequently outperform us in reasoning

what

99% of the time their reasoning is laughable. Or even if their reasoning is on the right track, they often just ignore it in the final answer, and do the stupid thing anyway.

kubb•3h ago

There are 2 kinds of people. Those who are outperformed on their most common tasks by LLMs and those who aren’t.

avs733•1h ago

there are also two kinds of people - those who are excited by that and those who are not.

The result is a 2x2 matrix where several quadrants are deeply concerning to me.

brookst•1h ago

There are also two kinds of people - those who are objective enough to tell when it happens and those who will never even see when they’re outperformed because of their cognitive biases.

I give you a 2x2x2 matrix.

avs733•1h ago

I'm sure if we work hard enough we can add a meta-meta-cognition level. Cognition is just 2^n series of binary states right?

kubb•26m ago

Sure, but if a person can find an easier way to do their job, they’ll usually do it. Usually the bias is towards less energy expenditure.

brookst•19m ago

For many people, yes. For people who have their identity invested in being the smartest person in the room, life is considerably harder.

amluto•3h ago

The best part when a “thinking” model carefully thinks and then says something that is obviously illogical, when the model clearly has both the knowledge and context to know it’s wrong. And then you ask it to double check and you give it a tiny hint about how it’s wrong, and it profusely apologizes, compliments you on your wisdom, and then says something else dumb.

I fully believe that LLMs encode enormous amounts of knowledge (some of which is even correct, and much of which their operator does not personally possess), are capable of working quickly and ingesting large amounts of data and working quickly, and have essentially no judgment or particularly strong intelligence of the non-memorized sort. This can still be very valuable!

Maybe this will change over the next few years, and maybe it won’t. I’m not at all convinced that scraping the bottom of the barrel for more billions and trillions of low-quality training tokens will help much.

NetRunnerSu•2h ago

True "interruption" requires continuous learning, and the current model is essentially a dead frog, and frozen weights cannot be truly grounded in real time.

https://news.ycombinator.com/item?id=44488126

brookst•1h ago

They key difference between that and humans, if course, is that most humans will double down on their error and insist that your correction is wrong, throwing a kitchen sink of appeals to authority, motte/bailey, and other rhetorical techniques at you.

TheOtherHobbes•14m ago

That's not any different in practice to the LLM "apologising" to placate you and then making a similar mistake again.

It's not even a different strategy. It's just using rhetoric in a more limited way, and without human emotion.

These are style over substance machines. Their cognitive abilities are extremely ragged and unreliable - sometimes brilliant, sometimes useless, sometimes wrong.

But we give them the benefit of the doubt because they hide behind grammatically correct sentences that appear to make sense, and we're primed to assume that language = sentience = intelligence.

phillipcarter•3h ago

> The main barrier is cost

I very much disagree. For the larger, more sophisticated stuff that runs our world, it is not cost that prohibits wide and deep automation. It's deeply sophisticated and constrained requirements, highly complex existing behaviors that may or may not be able to change, systems of people who don't always hold the information needed, usually wildly out of date internal docs that describe the system or even how to develop for it, and so on.

Agents are nowhere near capable of replacing this, and even if they were, they'd change it differently in ways that are often undesirable or illegal. I get that there's this fascination with "imagine if it were good enough to..." but it's not, and the systems AI must exist in are both vast and highly difficult to navigate.

ademup•2h ago

The status quo system you describe isn't objectively optimal. It sounds archaic to me. "We" would never intentionally design it this way if we had a fresh start. I believe it is this way due to a meriad of reasons, mostly stemming from the frailty and avarice of people.

I'd argue the opposite of your stance: we've never had a chance at a fresh start without destruction, but agents (or their near-future offspring) can hold our entire systems "in nemory", and therefore might be our only chance at a redo without literally killing ourselves to get there.

phillipcarter•20m ago

Agents quite literally cannot do this today.

Additionally, I disagree with your point:

> The status quo system you describe isn't objectively optimal.

On the basis that I would challenge you or anyone to judge what is objectively optimal. Google Search is a wildly complex system, an iceberg or rules on top of rules specifically because it is a digital infrastructure surrounding an organic system filled with a diverse group of people with ever-changing preferences and behaviors. What, exactly, would be optimal here?

adidoit•1h ago

"deeply sophisticated and constrained requirements"

Yes this resonates completely. I think many are forgetting the purpose of formal language and code was because natural language has such high ambiguity that it doesn't capture complex behavior

LLMs are great at interpolating between implicit and unsaid requirements but whether their interpolation matches your mental model is a dice throw

Michelangelo11•5h ago

> Each replication task consists of a detailed specification and a reference implementation. The central idea is that AI models are trained to produce an implementation that precisely matches the reference behavior. This clear-cut approach significantly simplifies evaluation, as the grading criteria are objective and direct: either the generated implementation behaves identically to the reference, or it doesn’t.

OK, but then you have to produce the detailed specification, working backward from the reference implementation. This is extremely non-trivial and it significantly weakens the TFA's parallels to pre-training, in which you don't need really need inputs other than raw text corpora.

I'm not saying this eliminates the idea outright, but I do think it hobbles it badly.

dist-epoch•5h ago

The detailed specification is the output for a particular input.

And you can use a fuzzer to augument that.

YetAnotherNick•4h ago

When prompted correctly, models could generate good specification in form of pretty exhaustive tests. While all tests have weaknesses and are not formal specification, they could get us 99% there.

vessenes•3h ago

I’d like to courteously disagree. I think existing models and existing tools are good enough to bootstrap this at least.

I’d propose the following architecture:

Step 1: Microsoft phi style - read code and write specifications using a frontier model. You could use an ensemble here to nitpick the spec; it’s only going to get written once. We also have of course many many rfcs and codebases that conform to them or where they do not we have an existing repository of bug reports, patches, forum complaints, etc.

Step 2-4: implement multilayer evaluation: does it compile? Does an existing model think the code complies with the spec on inspection? When it’s run on qemu are the key evals the same as the original software?

I propose most of steps 2-4 are automatable and rely on existing tooling and provide a framework that is, if not cheap, achievable. I’m also sure someone could improve this plan with a few more minutes of thought.

To me the interesting question is - will this add capabilities at current model sizes? My prior is yes in that the current behemoth size models feel like they are only incrementally better than 1/10 size distills. I interpret that to mean we haven’t gotten the most out of these larger scales. I will note Dario disagrees on this - he’s publicly said we need at least 10x more scale than we have now.

Szpadel•5h ago

with RL it's hard to define score function in many categories. rhis is especially visible in current coding capabilities. LLM will very often create sloppy solutions because they work well in RL. hardcoding API keys? ignoring errors? disabling lints? those pass in automated evaluation therefore are reinforced in training. are they good solutions? of course not.

It's very hard to define (in way to create lints) what makes core readable and maintainable. Using other LLM for this task could cause original model to game the system by abusing some weaknesses in the other model.

for other tasks, how do you even evaluate thinks like eg user experience/app design? how to properly evaluate pelican ridding bicycle?

CuriouslyC•3h ago

You can project them onto a linear space by gathering enough pairwise evaluations. PelicanElo.

esperent•1h ago

> hardcoding API keys? ignoring errors? disabling lints?

These kind of "rookie mistakes" are not things that any modern LLM is likely to do. Indeed, I had to argue quite strongly with Gemini recently when I was learning a new tool (so basically just playing around with a fully local setup) and I hardcoded an API key then tried to commit it. The LLM did NOT like that! I had to carefully explain that this was a toy repo.

The argument against this (by Gemini) was that toy repos often grow into production tools so it's best to follow basic security rules from the start. Which, to be fair, is a good argument. I still committed the key though (and deleted the repo a day or so later).

nikanj•3h ago

What is RL in this context?

hazn•3h ago

Reinforcement Learning

gcanyon•3h ago

The "GPT-3 moment" framing is a bit hype-y I think? GPT-3 eliminated the need for task-specific fine-tuning, but from the article RL wouldn't replace LLM-style pretraining. So this is more of an incremental advance than the paradigm shift GPT-3 represented. That said, if it unlocks RL generalization that would be huge.

The core claim that massive-scale RL will unlock generalization doesn't seem that surprising since we've seen the scaling hypothesis play out across ML. But "replication training" on software is interesting: learning by copying existing programs potentially unlocks a ton of complex training data with objective evaluation criteria.

To me, the big unanswered question is whether skills learned from replicating software would generalize to other reasoning tasks. That's a significant "if" - great if it works, pointless if it doesn't.

kevindamm•3h ago

It's a very big "if" because other fields are comparatively underspecified. There's no equivalent to a compiler or interpreter in most cases (with spreadsheets being the lingua franca that comes even close for most industries).

It would "work" but I think it will need even more scrutiny by experts to confirm what's correct and what needs to be re-generated. Please please no vibe accounting.

cjblomqvist•3m ago

> Please please no vibe accounting.

Funny you mention; There are multiple companies in Sweden working on AI/ML based accounting. It's not so different from AI/ML based automated driving.

mcbuilder•1h ago

This article stands as complete hype. They just seem to offer an idea of "replication training" which is just some vague agentic distributed RL. Multi-agent distributed reinforcement learning algorithms have been in the actual literature for a while. I suggest studying what DeepMind is doing for current state of the art in agentic distributed RL.

cs702•2h ago

TL;DR: The OP believes that if we train large AI models via RL to duplicate the behavior of existing software (for example, train them to duplicate the behavior of an existing spreadsheet, an existing command-line tool, or an existing application), large AI models will get good at:

* reading and understanding long, complicated, detailed instructions,

* executing those instructions meticulously and precisely, without errors,

* noticing its mistakes, if there are any along the way, and recovering from them,

* not settling prematurely for solutions that look "good enough" but aren't, and

* undertaking large, complicated projects which previously could be completed only by teams of human experts.

There's a good chance the OP is right, in my view.

We sure live in interesting times!

throwaway992673•2h ago

What a relief, I was horrified this was going to be an atrocious Rocket League update.

pfdietz•1h ago

I like this idea. It's adjacent to differential testing of manually created software (as in compilers) and to mutation testing for evaluation and generation of test suites.

fnord77•1h ago

where can I learn about the nitty gritty of RL and RL training? For instance, I want to understand how say software could be used as input (tokenization/vectorization of the code?)

OtherShrezzing•1h ago

> Simple command-line tools that implement obscure hashing and encryption algorithms are straightforward initial targets, but this approach can easily extend to more complex software, such as websites, professional software, and games.

I really don't see the connection from the statements in the article's content, and the assertion near the start that:

>Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks.

There's no clear reason outlined in the piece to describe why narrow & well-scoped 1-person-day tasks might scale up to 10,000-person-year projects. If they did, we should expect far more 10,000-person-year projects in the real economy, because the learning curve for firms scaling would be something approximating a straight line. There are very few 10,000-person-year projects, and very many 1-person-day projects.

It seems more like this will spend an unimaginable amount of compute, in order to produce models which are incredibly good at a very precise form of IP theft, and not especially good at any generalisable skills. It's so ludicrously rare that an engineer (or author, illustrator, etc) is tasked with "create a pixel-perfect reimplementation of this existing tool".

rightbyte•1h ago

> models which are incredibly good at a very precise form of IP theft

A smell big success? Copyright laundering is the killer app of AI this far.

m3kw9•18m ago

Upcoming sure, everything can be upcoming like the famous ASI

How does a screen even work?

Reading Neuromancer for the first time in 2025

Bypassing Google's big anti-adblock update

Show HN: Learn LLMs LeetCode Style

Axon's Draft One AI Police Report Generator Is Designed to Defy Transparency

Infisical (YC W23) Is Hiring DevRel Engineers

Notes on Graham's ANSI Common Lisp (2024)

The upcoming GPT-3 moment for RL

Local Chatbot RAG with FreeBSD Knowledge

The Decipherment of the Dhofari Script

Understanding Tool Calling in LLMs – Step-by-Step with REST and Spring AI

Zig's New Async I/O

Gaming cancer: How citizen science games could help cure disease

Monitoring My Homelab, Simply

The Robot Sculptors of Italy

Chrome's hidden X-Browser-Validation header reverse engineered

Let me pay for Firefox

Switching to Claude Code and VSCode Inside Docker

A Mental Model for C++ Coroutine

Lua beats MicroPython for serious embedded devs

Hacking Coroutines into C

Show HN: ArchGW – an intelligent edge and service proxy for agents

Aeron: Efficient reliable UDP unicast, UDP multicast, and IPC message transport

Parse, Don’t Validate – Some C Safety Tips

C++: Maps on Chains

Experimental imperative-style music sequence generator engine

Lost Chapter of Automate the Boring Stuff: Audio, Video, and Webcams in Python

Edward Burtynsky's monumental chronicle of the human impact on the planet

Capturing the International Space Station (2022)

Forget borrow checkers: C3 solved memory lifetimes with scopes

The upcoming GPT-3 moment for RL

Comments

How does a screen even work?

Reading Neuromancer for the first time in 2025

Bypassing Google's big anti-adblock update

Show HN: Learn LLMs LeetCode Style

Axon's Draft One AI Police Report Generator Is Designed to Defy Transparency

Infisical (YC W23) Is Hiring DevRel Engineers

Notes on Graham's ANSI Common Lisp (2024)

The upcoming GPT-3 moment for RL

Local Chatbot RAG with FreeBSD Knowledge

The Decipherment of the Dhofari Script

Understanding Tool Calling in LLMs – Step-by-Step with REST and Spring AI

Zig's New Async I/O

Gaming cancer: How citizen science games could help cure disease

Monitoring My Homelab, Simply

The Robot Sculptors of Italy

Chrome's hidden X-Browser-Validation header reverse engineered

Let me pay for Firefox

Switching to Claude Code and VSCode Inside Docker

A Mental Model for C++ Coroutine

Lua beats MicroPython for serious embedded devs

Hacking Coroutines into C

Show HN: ArchGW – an intelligent edge and service proxy for agents

Aeron: Efficient reliable UDP unicast, UDP multicast, and IPC message transport

Parse, Don’t Validate – Some C Safety Tips

C++: Maps on Chains

Experimental imperative-style music sequence generator engine

Lost Chapter of Automate the Boring Stuff: Audio, Video, and Webcams in Python

Edward Burtynsky's monumental chronicle of the human impact on the planet

Capturing the International Space Station (2022)

Forget borrow checkers: C3 solved memory lifetimes with scopes