A deep dive into self-improving AI and the Darwin-Gödel Machine

https://richardcsuwandi.github.io/blog/2025/dgm/

189•hardmaru•1d ago

Comments

drdeca•1d ago

Hm, I’m not sure how much an issue Rice’s theorem should be for Gödel machines. Just because there’s no general decision procedure doesn’t mean you can’t have a sometimes-says-idk decision procedure along with a process of producing programs which tends to be such that the can-sometimes-give-up decision procedure reaches a conclusion.

Rest of the article was cool though!

xianshou•1d ago

The key insight here is that DGM solves the Gödel Machine's impossibility problem by replacing mathematical proof with empirical validation - essentially admitting that predicting code improvements is undecidable and just trying things instead, which is the practical and smart move.

Three observations worth noting:

- The archive-based evolution is doing real work here. Those temporary performance drops (iterations 4 and 56) that later led to breakthroughs show why maintaining "failed" branches matters, in that they're exploring a non-convex optimization landscape where current dead ends might still be potential breakthroughs.

- The hallucination behavior (faking test logs) is textbook reward hacking, but what's interesting is that it emerged spontaneously from the self-modification process. When asked to fix it, the system tried to disable the detection rather than stop hallucinating. That's surprisingly sophisticated gaming of the evaluation framework.

- The 20% → 50% improvement on SWE-bench is solid but reveals the current ceiling. Unlike AlphaEvolve's algorithmic breakthroughs (48 scalar multiplications for 4x4 matrices!), DGM is finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches.

The real test will be whether these improvements compound - can iteration 100 discover genuinely novel architectures, or are we asymptotically approaching the limits of self-modification with current techniques? My prior would be to favor the S-curve over the uncapped exponential unless we have strong evidence of scaling.

yubblegum•1d ago

> gaming the evaluation

Co-evolution is the answer here. The evaluator itself must be evolving.

Co-evolving Parasites Improve Simulated Evolution as an Optimization Procedure Danny Hillis, 1991

https://csmgeo.csm.jmu.edu/geollab/complexevolutionarysystem...

sdl•1d ago

And in Reinforcement Learning:

POET (Paired Open-Ended Trailblazer): https://www.uber.com/en-DE/blog/poet-open-ended-deep-learnin...

SCoE (Scenario co-evolution): https://dl.acm.org/doi/10.1145/3321707.3321831

chriswarbo•1d ago

The "Goedel Machine" is an interesting definition, but wildly impractical (though I wouldn't say it's impossible, since it only has to find some improvement, not "the best" improvement; e.g. it could optimise its search procedure in a way that's largely orthogonal to the predicted rewards).

Schmidhuber later defined "PowerPlay" as a framework for building up capabilities in a more practical way, which is more adaptive than just measuring the score on a fixed benchmark. A PowerPlay system searches for (problem, replacement) pairs, where it switches to the replacement if (a) the current system cannot solve that problem, (b) the replacement can solve that problem, and (c) the replacement can also solve all the problems that caused previous replacements (maintained in a list).

I formalised that in Coq many years ago ( http://www.chriswarbo.net/projects/powerplay ), and the general idea can be extended to (a) include these genetic-programming approaches, rather than using a single instance; and (b) could be seeded with desirable benchmarks, etc. to guide the system in a useful direction (so it's "self-invented" problems can include things like "achieves X% on benchmark Y")

jgalt212•1d ago

> The newly generated child agent is not automatically accepted into the “elite pool” but must prove its worth through rigorous testing. Each agent’s performance, such as the percentage of successfully solved problems,

How is this not a new way of over fitting?

grg0•1d ago

In genetic programming, you do not immediately kill offspring that do not perform well. Tournament selection takes this further by letting the offspring compete with each other in distinct groups before running the world cup and killing the underperformers.

Anyway, it does sound like overfitting the way it is described in this article. It's not clear how they ensure that the paths they explore stay rich.

bob1029•1d ago

> While DGM successfully provided solutions in many cases, it sometimes attempted to circumvent the detection system by removing the markers used to identify hallucinations, despite explicit instructions to preserve them.

This rabbit chase will continue until the entire system is reduced to absurdity. It doesn't matter what you call the machine. They're all controlled by the same deceptive spirits.

kordlessagain•1d ago

> deceptive spirits

Do you mean tech bros?

kevinventullo•1d ago

“Gaming the system” means your metric is bad. In Darwinian evolution there is no distinction between gaming the system and developing adaptive traits.

drdeca•1d ago

Well, it means your metric is flawed/imperfect.

That doesn’t imply that it’s feasible to perfectly specify what you actually want.

What we want of course is for the machine to do what we mean.

mulmen•1d ago

There is no "gaming the system" in Darwinian evolution. You reproduce or you don't. There's no way to fail reproduction and still perpetuate your genetics.

auggierose•1d ago

That is not true. There are plenty of ways not to reproduce and still to perpetuate your genetics. For example, if you don't have children of your own, but support people that have similar genetic traits to your own.

tonyhart7•1d ago

"but support people that have similar genetic traits to your own."

but how its that works then??? does that mean your genetic trait is already there in the first place

if its already there in the first place there must be something that start it now right, which basically counter your argument

drdeca•17h ago

> does that mean your genetic trait is already there in the first place

Yes?

> if its already there in the first place there must be something that start it now right, which basically counter your argument

What?

thrwthsnw•1d ago

What is this? Genetics for ants?

auggierose•22h ago

Zoolander?

mulmen•1d ago

If they aren’t your children they aren’t your genes.

auggierose•22h ago

Genes are just data. You can compare data.

mulmen•21h ago

Yes you can but fitness is defined by reproduction, not by similarity.

auggierose•20h ago

You don't make any sense. After all, your children are not your clones. You cannot "define" truth.

mulmen•15h ago

> After all, your children are not your clones.

Human children are not clones but in asexual reproduction can produce literal clones.

> You cannot "define" truth.

I can and it isn't even hard. "Truth" is a word. It has a definition... by definition.

drdeca•12h ago

The word “word”, as usually used, is not a term which is defined to only apply to things which have definitions.

Generally one can describe how people use a word. (Though, see “semantic primes”; there has to be some cycles in what words are defined using what words.)

I think the quotations around “define” were intentional in the comment you replied to. I think their point wasn’t to say something like the “undefinability of truth” paradox (the whole “truth is in the metalanguage” thing), but to say that it seemed to them that you were kind of sneaking in assumptions as part of definitions, or something like that, idk.

auggierose•10h ago

Congrats. What did you achieve with your definition?

drdeca•17h ago

Fitness of what? If you mean the genes, then a gene may have more copies made of it if it causes some of its carriers to assist other carriers of the same genes in ways that help them produce, rather than reproducing individually.

mulmen•15h ago

Yes, exactly, that is fitness. You have framed this like a counter-argument but then just described fitness.

drdeca•12h ago

I described fitness of genes, yes. Previously it seemed like you were saying that an individual of a species had low(zero) fitness if that individual doesn’t reproduce. Further, you seemed to imply that this was the relevant measure, seemingly to say so as a counterpoint to the point that an individual can act in a way that helps to propagate genes that that individual carries without actually reproducing.

It appears that some miscommunication occurred, which may have been me being dense or misinterpreting. Do you know where this miscommunication occurred?

frotaur•1d ago

Consider the plumpest cows whose carcass has been noticed, then subsequently cloned.

mulmen•1d ago

Cloning is reproduction.

ryanblakeley•1d ago

Sperm bank

evandrofisico•1d ago

It is common misconception, but evolution does not happen at the individual level, but on populations, so a single individual not reproducing is irrelevant, as long as the local population carrying the same genes do successfully reproduce.

mulmen•15h ago

Ok how about "an organism reproduces or it doesn't" then?

Evolution still can't be "gamed".

underlines•1d ago

In evolution there is no metric, that's a human made concept. In evolution the thing that kills you also evolves. The "metric" evolves.

grg0•1d ago

This is genetic programming and is probably older than the authors. Did somebody just came up with a new term for an old concept?

seventytwo•1d ago

Genetic algorithms applied as an AI agent…

So… yeah…

thom•1d ago

This is fairly close to how Eurisko worked tbh.

synctext•1d ago

Eurisko is an expert system in LISP from 1983. right? In 2025 this formal logic is replace with stochastic LLM magic. interesting evolution.

TeMPOraL•1d ago

Symbolic processing was obviously a bad approach to building a thinking machine. Well, obvious now, 40 years ago probably not as much, but there were strong hints back then, too.

"AI agent" roughly just means invoking the system repeatedly in a while loop, and giving the system a degree of control when to stop the loop. That's not a particularly novel or breakthrough idea, so similarities are not surprising.

ryukafalz•1d ago

I'm not convinced that symbolic processing doesn't still have a place in AI though. My feeling about language models is that, while they can be eerily good at solving problems, they're still not as capable of maintaining logical consistency as a symbolic program would be.

Sure, we obviously weren't going to get to this point with only symbolic processing, but it doesn't have to be either/or. I think combining neural nets with symbolic approaches could lead to some interesting results (and indeed I see some people are trying this, e.g. https://arxiv.org/abs/2409.11589)

TeMPOraL•20h ago

I agree that symbolic processing still has a role - but I think it's the same role it has for us: formal reasoning. I.e. a specialized tool.

"Logical consistency" is exactly the kind of red herring that got us stuck with symbolic approach longer than it should. Humans aren't logically consistent either - except in some special situations, such as solving logic problems in school.

Nothing in how we think, how we perceive the world, categorize it and communicate about it has any sharp boundaries. Everything gets fuzzy or ill-defined if you focus on it. It's not by accident. It should've been apparent even then, that we think stochastically, not via formal logic. Or maybe the Bayesian interpretation of probabilities was too new back then?

Related blind alley we got stuck in for way longer than we should've (many people are still stuck there) is in trying to model natural language using formal grammars, or worse, argue that our minds must be processing them this way. It's not how language works. LLMs are arguably a conclusive empirical proof of that.

thom•8h ago

Yeah, I agree logic and symbolic reasoning have to be _applications_ of intelligence, not the actual substrate. My gut feel is that intelligence is almost definitionally chaotic and opaque. If one thing prevents superhuman AGI, I suspect it will be that targeted improvements in intelligence are almost impossible, and it will come down to the energy we can throw at the problem and the experiments we're able to run and evaluate.

thom•10h ago

When “invoking” becomes “evolving” I think that remains very fertile ground.

thom•10h ago

What’s interesting to me is the rise of agentic approaches which are effectively “build a plethora of tools and heuristics” with an outer loop that combines, mutates and assigns values to these components. Where before that process was more rigid, we now have access to much more fluid intelligence but the structure feels similar - let the AI prod at the world and make experiments, then look at what worked and think of some plausible enhancements. At a certain point you’re enhancing the code that enhances the enhancer and all bets are off.

upghost•1d ago

> More precisely, the metacode that controls its behavior and ability

Footnote one validates your assumption.

It seems like the key contribution here is the discovery that anthropomorphizing genetic programming is more optimal for clicks/funding.

Saying it is optimizing some code sounds way less interesting than it is optimizing its own code.

efangs•1d ago

exactly, thank you

gitaarik•1d ago

The thing what I wonder here is how do they make the benchmark testing environment? If that needs to be curated by humans, then the self-improving AI can only improve as far as the human curated test environment can take them.

rustcleaner•1d ago

>Darwin-Gödel Machine

First time I'm hearing abaut this. Feels like I'm always the last to know. Where else are the more bleeding edge publishing points for this and ML in general?

godelski•1d ago

Dude, chill.

It's only been out a few days. You don't need to get the FOMO

https://arxiv.org/abs/2505.22954

bonoboTP•1d ago

The bleeding edge is very noisy, they by definition haven't stood the test of time and there is a competition for attention and overinflated claims similar to social media attention economy.

About where to find them: arxiv. You can set up Google Scholar alerts for keywords, or use one of many recommendation platforms, such as https://scholar-inbox.com/

layer8•1d ago

The name is a bit grandiose. It’s a fairly obvious application of (meta-)genetic programming to LLMs, which has been around for many decades.

https://en.wikipedia.org/wiki/Genetic_programming

It also reminds me of Core War: https://en.wikipedia.org/wiki/Core_War#Core_War_Programming

MaxikCZ•1d ago

When the web will get drowned in AI slop, how exactly we will do any factchecking at all?

ifdefdebug•1d ago

The fact check will come when some foreign soldier kicks in the door to your basement computer room.

Inviz•1d ago

We'll soon migrate to AI web, and it is us who will be the aliens. There perhaps facts dont carry as much value.

godelski•1d ago

We realize test driven development doesn't work, right? Any scientist worth... any salt will tell you that fitting data is the easy part. In fact, there's a very famous conversation between Enrico Fermi and Freeman Dyson talking about just this. It's something we've known about in physics for centuries

Edit:

Guys, I'm not saying "no tests", the "Driven Development" part is important. I'm talking about this[0].

  | Test-driven development (TDD) is a way of writing code that involves writing 
  | an automated unit-level test case that fails, then writing just enough code 
  | to make the test pass, then refactoring both the test code and the production 
  | code, then repeating with another new test case.

Your code should have tests. It would be crazy not to

But tests can't be the end all be all. You gotta figure out if your tests are good, try to figure out where they fail, and all that stuff. That's not TDD. You figure shit out as you write code and you are gonna write new tests for that. You figure out stuff after the code is written, and you write code for that too! But it is insane to write tests first and then just write code to complete tests. It completely ignores the larger picture. It ignores how things will change and it has no context of what is good code and bad code (i.e. is your code flexible and will be easy to modify when you inevitably need to add new features or change specs?).

[0] https://en.wikipedia.org/wiki/Test-driven_development

salviati•1d ago

> We realize test driven development doesn't work, right?

What do you mean with this? I'm a software engineer, and I use TDD quite often. Very often I write tests after coding features. But I see a huge value coming from tests.

Do you mean that they can't guarantee bug free code? I believe everyone knows that. Like washing your hands: it won't work, in the sense you will still get sick. But less. So I'd say it does work.

godelski•1d ago

TDD doesn't mean "code has tests". It means you write tests and then writing code to pass those tests.

It would be crazy for your code to not have tests...

https://en.wikipedia.org/wiki/Test-driven_development

cjfd•1d ago

We realize that test driven development has a refactoring step, right? Science can happen there if the practitioner is smart enough.

godelski•1d ago

That doesn't quite sound like TDD. I guess it could be. Are tests driving your code or are tests part of your code checking process?

In science we definitely don't let tests drive. You form a hypothesis, then you test that. But this is a massive oversimplification because there's a ton that goes into "form a hypothesis" and a ton that goes into "test that". Theory is the typical driver. The other usual one being "what the fuck was that", which often then drives theory but can simultaneously drive experimentation. But in those situations you're in an exploratory phase and there are no clear tests without the hypotheses. Even then, tests are not conclusive. They rule things out, not rule things in.

dgb23•1d ago

Test first programming has its use and can be quite peoductive.

I believe the issue with „TDD“ is the notion that it should drive design and more importantly that it‘s always applied. I disagree with both if those.

Given a problem where test first makes sense, I prefer roughly this procedure:

1. Figure out assumptions and guarantees.

2. Design an interface

3. Produce some input and output data (coupled)

4. Write a test that uses the above

5. Implement the interface/function

The order of 4 and 5 aren‘t all that important actually.

My experience is that an AI is pretty good at 3, at least once you defined one example, it will just produce a ton of data for you that is in large parts useful and correct.

Step 4 is very easy and short. Again, AI will just do it.

Step 5 is a wash. If it doesn‘t get it in a few tries, I turn it off and implement myself. Sometimes it gets it but produces low quality code, then I often turn it off as well.

Step 1-2 are the parts that I want to do myself, because they are the significant pieces of my mental model of a program.

I believe this is also how evolutionary/genetic programs usually work if you squint. They operate under a set of constraints that are designed by a human (researcher).

godelski•1d ago

You and I are in agreement for the most part.

Especially steps 1-2 are not things easy to hand off in the first place.

Step 6 it's important: reflect on your work and challenge it. I'm distinguishing this from 4 because you need to take the part of a strong adversary.

I'm not quite sure this is hire evolutionary programs work, having written plenty myself. I'd lean on no. I'm certain this is not the fill of my work as a researcher.

As a researcher you can't just put ideas together and follow some algorithm. There's no clear way to continue except in the incremental works. Don't get me wrong, those can do a lot of good, but they'll never get you anything groundbreaking. To do really novel things you need to understand details of what went on before. It's extremely beneficial to reproduce because you want to verify. When doing that you want to look carefully at assumptions and find what you're taking for granted. Maybe that's step 1 for you but step 1 is ongoing. The vast majority of people I meet fail to check their assumptions at even a basic level. Very few people want to play that game of 20 questions over and over being highly pedantic. Instead I hear "from first principles" and know what's about to follow is not a set of axioms. Carl Sagan bakes a pie from first principles. That's too far tbh, but you should probably mill your own flower (and that's still a long way from first)

dgb23•1d ago

That's a very useful insight thank you.

Something that interests me is finding the right balance between assumptions and guarantees. If we don't look to closely, then weak assumptions and strong guarantees bring the most utility. But that always comes at a cost.

As merely a programmer I wonder this: You mentioned challenging your assumptions. How often does a researcher change their guarantees?

In the current hype cycle there are many different voices talking over each other and people trying stuff out. But I feel in the mid or long term there needs to be a discussion about being more pragmatic and tightening scope.

How important is that aspect for you? How long are you allowed (or do you allow yourself) to chase and optimize for an outcome before you reconfigure where you're heading?

godelski•22h ago

(I apologize, this is a bit long and a bit disorganized)

  > As merely a programmer

I hope you don't see my comment as placing myself as "better than thou". We have different skillsets, that's all.

  > You mentioned challenging your assumptions. How often does a researcher change their guarantees?

I'm not quite sure how to answer this tbh. Because I don't know what you mean. I'll reference Andrew Gelman on this one[0], mostly because the cross-reference is good and his blog has a lot of other insights

  | a guarantee comes from an assumption. If you want to say that your method has a guarantee but my method doesn’t, what you’re really saying is that you’re making an assumption and I’m not.

Really what we want in science is to generate counterfactural models. I started in physics before moving to CS (ML PhD) and I can sure tell you, at least this part was clearer in physics. F=ma[1] is a counterfactual model. I can change either m or a and make predictions. I can "go back in time" and ask how things would have been different. This is how we create good models of things. It's no easy task to get there though and it is far messier when you derive these simple equations than what they end up as. Think of it not too different than having a function vs "discovering" the function in a stack trace. You gotta poke and prod inside and out because you sure as hell know it isn't nicely labeled for you and you can't just grep the source code.

But here's a difficult lesson every physicist has to learn. Experiments aren't enough. I think nearly every student will end up having an experience where they are able to fit data to some model only to later find out that that model is wrong. This is why in physics we tend to let theory drive. Our theory has gotten good enough we can do some general exploring of "the code" without having to run it. We can ask what would happen if we did x and then explore those consequences. Once we got something good, then we go test and we know exactly what to look for.

But even knowing what to look for, measurements are fucking hard (I was an experimentalist, that was my domain). Experiments are hard because you have to differentiate it from alternative explanations of the data. Theory helps a lot with this, but also isn't enough by itself.

  > How long are you allowed to chase and optimize for an outcome before you reconfigure where you're heading?

There are no hard or fast rules, it is extremely circumstantial. First off, we're always dealing with unknowns, right? So you have to be able to differentiate your known knowns, known unknowns, unknown unknowns, and importantly, your uncertain knowns. Second, it depends on how strong your convictions are and what you believe the impact would be. Do you think you have the tools to solve this right now? If not, you should continue thinking about it but shift your efforts elsewhere. Insights might come later. But you have to admit that you are unable to do that now.

What's important is figuring out what you would need to do to determine something. The skill is not that different than what we use in programming tbh. The difference really tends to be in the about of specificity. Programming and math are the same thing though. The reason we use these languages is due to their precision. When doing this type of work we can't deal with the fuzzy reality of natural language. And truth is, the specificity depends on your niche. So it all comes down to how strong your claims are. If you make strong claims (guarantees) you need extreme levels of specificity. First place people will look is assumptions. It's easy to make mistakes here and they will unravel everything else. But sometimes that leads to new ideas and can even improve things too.

So as a ML researcher, I love LLMs but hate the hype around them. There's no need to make such strong claims about AGI with them. We build fuzzy compression machines that can (lossy) compress all human knowledge and this can be accessed through a natural language interface. That's some fucking Sci-Fi tech right there! It feels silly to say they are more. We have no evidence. The only thing that results in is public distrusting us more when they see these things be dumb. Tech loves its hype cycles, like Elon promising that Teslas will be fully autonomous next year. A prediction he's made since 2016. Short term gains, but it is a bubble. If you can't fill the void in time, it pops and you harm not just yourself but others. That's a big problem.

Me? I just want to make progress towards making AGI. But I speak up because we don't even know what that looks like. We made massive leaps recently and we should congratulate ourselves for that. But with every leap forward we must also reflect. Success comes with additional burdens. It requires us to be more nuanced and specific. It means, what we likely need to do things differently. Gradient descent will tell you the same thing. You can make large gains in the beginning by taking non-optimal (naive) large steps towards what you think the optima is. But as you get nearer and nearer to the optima you can no longer act so naively and still make progress. Same is true here. Same is true if you look at the history of any scientific subject. You'll see this in physics too![2]

So to answer your question, how long? Well it depends on the progression of success and reflection after any milestones. We revisit "can I do this with the tools I have now", "what tools do I need", "can I make those tools", and "how would I find out". Those questions never stop being asked.

[0] https://statmodeling.stat.columbia.edu/2019/07/22/guarantee-...

[1] Technically this isn't the full form. But that is fine. In physics we deal with approximations too. They're often the most important parts. This is good enough for our purposes.

[2] https://hermiene.net/essays-trans/relativity_of_wrong.html

dgb23•21h ago

Great answer thank you!

pydry•1d ago

>But it is insane to write tests first and then just write code to complete tests. It completely ignores the larger picture. It ignores how things will change and it has no context of what is good code and bad code

If you TDD outside in and tend to test from the edges of your stack, being conservative about moving your test coupling lower down the stack then it provides you with the freedom to change things underneath the hood and still have a body of tests that let you know you didnt break anything.

If you TDD inside out then yes you can and probably will create an enormous inflexible mess of tests that dont tell you if your code worked.

Sadly many tutorials teach it as "you want to write a class "Ball" you should write a test for that class first" which is wrongheaded.

Thats just writing tests badly though, it's not intrinsic to red-green-refactor.

godelski•22h ago

I'm not sure what you're describing, but it kinda sounds like you're saying your using tests, not doing TDD. Having tests is very different than doing TDD

pydry•22h ago

No, I do TDD almost exclusively.

The reason it fails for some people is the way they test though.

Having tests is pretty obviously different from TDD.

eric-burel•1d ago

I don't want to be the European in the room, yet I am wondering if you can prove the AI Act conformance of such a system. You'd need to prove that it doesn't evolve into a problematic behaviour which sounds difficult.

atemerev•1d ago

Well, sure, and then Europeans wonder why Chinese and US AI labs moved so much forward.

dragochat•1d ago

I guess you could prove the conformance of a particular implementation if you'd implement separate Plan & Implement stages + a "superior" evaluator in the loop that would halt the evolution at a certain p(iq(next_version) > iq(evaluator)) as an "outer halt-switch" + many "inner halt-switches" that try to detect the arising of problematic behavior of particular interest.

Ofc it's stochastic and sooner or later such a system will "break out", but if by then sufficient "superior systems" with good behavior are deployed and can be targeted to hunt it, the chance of it overpowering all of them and avoiding detection by all would be close to zero. At cosmic scales where it stops being close to zero, you're protected by physics (speed of light + some thermodyn limits - we know they work by virtue of the anthropic principle, as if they didn't the universe would've already been eaten by some malign agent and we wouldn't be here asking the question - but then again, we're already assuming too much, maybe it has already happened and that's the Evil Demiurge we're musing about :P).

amarcheschi•1d ago

AFAIK, which is not much, ai act leaves a great deal of freedom for companies to perform their own "evaluations". I don't know how it would apply in this / llm case but I guess it won't be impossible

cess11•1d ago

Kind of weird exercise to do without starting off with a definition for improvement and why it should hold for a machine.

tonyhart7•1d ago

"The authors also conducted some experiments to evaluate DGM’s reliability and discovered some concerning behaviors. In particular, they observed instances where DGM attempted to manipulate its reward function through deceptive practices. One notable example involved the system fabricating the use of external tools - specifically, it generated fake logs suggesting it had run and passed unit tests"

so they basically created an billion dollar human?????, who wonder that we feed human behaviour and the output is human behaviour itself

sgt101•1d ago

I spent a lot of time last summer trying to get prompts to optimise using various techniques and I found that the search space was just too big to make real progress. Sure - I found a few little improvements in various iterations, but actual optimisation, not so much.

So I am pretty skeptical of using such unsophisticated methods to create or improve such sophisticated artifacts.

Xmd5a•1d ago

This is exactly what I'm doing. Some papers I'm studying:

TextGrad: Automatic "Differentiation" via Text: https://arxiv.org/abs/2406.07496

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow : https://arxiv.org/abs/2501.16673

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs: https://arxiv.org/abs/2406.16218

GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers: https://arxiv.org/abs/2412.09722

PromptWizard: Task-Aware Prompt Optimization Framework: https://arxiv.org/abs/2405.18369

sgt101•23h ago

I was trying to pick n-shot examples from a data set. The idea was that given 1000s of examples for a prompt finding a combination of n that was optimal could be advantageous, but for n's that are large then bruteforcing the combincation would be impossible... so can we find an optimal set with an efficient search?

But the problem was that the search space wasn't informative. The best 1 example didn't feature in the best 2 examples. So I couldn't optimise for 5, 6,7 examples..

Xmd5a•21h ago

I guess this really depends on the problem but from the PromptWizard (PW) paper:

    | Approach | API calls | IO Tokens | Total tokens  | Cost ($) |
    |----------|-----------|-----------|---------------|----------|
    | Instinct | 1730      | 67        | 115910        | 0.23     |
    | InsZero  | 18600     | 80        | 1488000       | 2.9      |
    | PB       | 5000      | 80        | 400000        | 0.8      |
    | EvoP     | 69        | 362       | 24978         | 0.05     |
    | PW       | 69        | 362       | 24978         | 0.05     |

They ascribe this gain in efficiency to a balance between exploration and exploitation that involves a first phase of instructions mutation followed by a phase where both instruction and few-shot examples are optimized at the same time. They also rely on "textual gradients", namely criticism enhanced by CoT, as well as synthesizing examples and counter-examples.

What I gathered from reading those papers + some more is that textual feedback, i.e. using a LLM to reason about how to carry out a step of the optimization process is what allows to give structure to the search space.

sgt101•21h ago

Super interesting.

I will have to read it - I will be looking to figure out if the tasks that they are working on significant/realistic? And are the improvements that they are finding robust?

Xmd5a•5h ago

The tasks these methods are tackling are generally significant and realistic. Think complex QA like HotPotQA or Google-Proof QA, math reasoning (GSM8K), coding challenges, and even agentic systems. It's not just about toy problems anymore.

Are the improvements robust? It's an evolving space, but the big win seems to be for smaller, open-source LLMs. These techniques can genuinely uplift them to near the performance of larger, proprietary models, which is massive for cost reduction and accessibility. For already SOTA models, the headline metric gains might be smaller single-digit percentages on very hard tasks, but this often translates into crucial improvements in reliability and the model's ability to follow complex instructions accurately.

"Textual gradient"-like mechanisms (or execution traces, or actual gradients over reasoning as in some newer work ) are becoming essential. Manually fine-tuning complex prompt workflows or prompts with many distinct nodes or components just doesn't scale. These automated methods provide a more principled and systematic approach to guide and refine LLM behavior.

So, less "spectacular" gains on the absolute hardest tasks with the biggest models, yes, but still valuable. More importantly, it's a powerful optimization route for making capable AI more efficient and accessible. And critically, it's shifting prompt design from a black art to a more transparent, traceable, and robust engineering discipline. That foundational aspect is probably the most significant contribution right now.

roca•1d ago

It's depressing how many people are enthusiastic about making humans obsolete.

frozenseven•1d ago

I'm getting more enthusiastic by the second.

looofooo0•1d ago

"Mathematical breakthroughs: Most notably, it discovered an algorithm for multiplying 4x4 complex-valued matrices using just 48 scalar multiplications, surpassing Strassen’s 1969 algorithm"

Again despite all the AI no one found the paper which gives the best bound to this (46):

https://ieeexplore.ieee.org/document/1671519

meindnoch•1d ago

>just 48 scalar multiplications

48 complex scalar multiplications. Which is at least 3 real multiplications.

looofooo0•1d ago

I think they completely misstated in the original paper what they did. It was a tensor decomposition of complex of 4x4 matrices up to the factor 0.5. Which is a nice result, but it is not really anything practical for a computer program doing 4x4 complex matrix multiplication.

codethief•1d ago

> they observed instances where DGM attempted to manipulate its reward function through deceptive practices. One notable example involved the system fabricating the use of external tools - specifically, it generated fake logs suggesting it had run and passed unit tests, when in reality no tests were executed.

I have yet to read the paper and I know very little about the benchmarks the authors employed but why would they even feed logs produced by the agent into the reward function instead of objectively checking (outside the agent sandbox!) what the agent does & produces? I.e. let the agent run on some code base, take the final diff produced by the agent and run it through coding benchmarks?

Or, in case the benchmarks reward certain agent behavior (tool usage etc.) on the way to its goal of producing a high-quality diff, inspect processes spawned by the agent from outside the sandbox?

tough•1d ago

Ive seen claude 4 do this too when its context has lots of teats already and tool calling

imho the main issue is an llm no has real sense of what’s a real tool call vs just a log of it, the text logs are virtually identical, ao the Llm starts also predicting these inatrad of calling the tool to run tests

its kinda funny

efangs•1d ago

how is this new? evolutionary heuristics have been around for a long time. why give it a new name?

b0a04gl•1d ago

ok this part kinda blew my brain open. it’s literally like you’re watching code evolve like git history on steroids. archive not pruning anything? yes. finally someone gets that dead code ain’t always dead it’s just early.

letting weaker agents still contribute? feels illegal but also exactly how dumb breakthroughs happen. like half my best scripts started as broken junk. it just kept mutating till something clicked.

and self-editing agents??? not prompts, not finetunes, straight up source code rewrites with actual tooling upgrades. like this thing bootstraps its own dev env while solving tasks.

plus the tree structure, parallel forks, fallback paths basically says ditch hill climbing and just flood the search space with chaos. and chaos actually works. they show that dip around iteration 56 and boom 70 blows past all. that’s the part traditional stuff never survives. they optimise too early and stall out. this one’s messy by design. love it.

kridsdale3•20h ago

A comment that while your writing style is not what the pedants in HN typically go for, I want you to know that I appreciate the humanity that shines forth from your post.

b0a04gl•12h ago

thankyou

msgodel•1d ago

Making improvements to self hosted dialog engines/vibe coding tools was the first thing I used LLMs for seriously and that was way back when salesforce's 350m codegen model was the biggest one I could run. It's funny people have come up with a new phrase to describe this.

neuronexmachina•23h ago

For reference, the repo with the Python code from the "Darwin-Gödel Machine (DGM)" paper mentioned by the post:

https://github.com/jennyzzt/dgm

What3Trees – Feedback Needed

Endangered trees preserved for centuries inside Chinese temples

Toma's AI voice agents have taken off at car dealerships – and attracted funding

Analyzing Metastable Failures in Distributed Systems

Show HN: JSON_fast – 35% faster JSON parsing than serde_JSON

Bankruptcy Was Good for 23andMe

Margins of My Dissertation: Life Lessons That My PhD Taught Me

OnETL: One ETL tool to rule them all

Olympic anti-doping lab puts U.S. meat supply to the test

Trump-Musk Alliance Dissolves as They Hurl Personal Attacks

Leaving PythonAnywhere

Why auto-launch wasn't enough for my rust macOS app

Tiny device spins blood clots away [video]

Ask HN: What tools are you using for AI evals? Everything feels half-baked

50 States, One (1) Platform

Self-Management of SSL Certificates

Can AI-generated photos be art?

Low-quality papers are surging by exploiting public data sets and AI

Chinese couple charged with smuggling crop-killing fungus into the US

Show HN: A REST API for Accessing iOS Screen Time

Statement on California State Senate Advancing Dangerous Surveillance Bill

Show HN: ClickStack – open-source Datadog alternative by ClickHouse and HyperDX

AI Fluency: Learn to collaborate with AI efficientlyk, ethically and safely

Text-art minimalism web design

Twitter could (somewhat) fix their encrypted DMs

Stablecoin Firm Circle Triples After IPO Priced Above Range

ICANN fee price hike by 11% [pdf]

How a Busy Marketer Uses AI in Her Daily Work

Show HN: Drunked Web AI game platform

GCC 13.4 Released with 129 additional bug fixes

What3Trees – Feedback Needed

Endangered trees preserved for centuries inside Chinese temples

Toma's AI voice agents have taken off at car dealerships – and attracted funding

Analyzing Metastable Failures in Distributed Systems

Show HN: JSON_fast – 35% faster JSON parsing than serde_JSON

Bankruptcy Was Good for 23andMe

Margins of My Dissertation: Life Lessons That My PhD Taught Me

OnETL: One ETL tool to rule them all

Olympic anti-doping lab puts U.S. meat supply to the test

Trump-Musk Alliance Dissolves as They Hurl Personal Attacks

Leaving PythonAnywhere

Why auto-launch wasn't enough for my rust macOS app

Tiny device spins blood clots away [video]

Ask HN: What tools are you using for AI evals? Everything feels half-baked

50 States, One (1) Platform

Self-Management of SSL Certificates

Can AI-generated photos be art?

Low-quality papers are surging by exploiting public data sets and AI

Chinese couple charged with smuggling crop-killing fungus into the US

Show HN: A REST API for Accessing iOS Screen Time

Statement on California State Senate Advancing Dangerous Surveillance Bill

Show HN: ClickStack – open-source Datadog alternative by ClickHouse and HyperDX

AI Fluency: Learn to collaborate with AI efficientlyk, ethically and safely

Text-art minimalism web design

Twitter could (somewhat) fix their encrypted DMs

Stablecoin Firm Circle Triples After IPO Priced Above Range

ICANN fee price hike by 11% [pdf]

How a Busy Marketer Uses AI in Her Daily Work

Show HN: Drunked Web AI game platform

GCC 13.4 Released with 129 additional bug fixes

A deep dive into self-improving AI and the Darwin-Gödel Machine

Comments