Three observations worth noting:
- The archive-based evolution is doing real work here. Those temporary performance drops (iterations 4 and 56) that later led to breakthroughs show why maintaining "failed" branches matters, in that they're exploring a non-convex optimization landscape where current dead ends might still be potential breakthroughs.
- The hallucination behavior (faking test logs) is textbook reward hacking, but what's interesting is that it emerged spontaneously from the self-modification process. When asked to fix it, the system tried to disable the detection rather than stop hallucinating. That's surprisingly sophisticated gaming of the evaluation framework.
- The 20% → 50% improvement on SWE-bench is solid but reveals the current ceiling. Unlike AlphaEvolve's algorithmic breakthroughs (48 scalar multiplications for 4x4 matrices!), DGM is finding better ways to orchestrate existing LLM capabilities rather than discovering fundamentally new approaches.
The real test will be whether these improvements compound - can iteration 100 discover genuinely novel architectures, or are we asymptotically approaching the limits of self-modification with current techniques? My prior would be to favor the S-curve over the uncapped exponential unless we have strong evidence of scaling.
Co-evolution is the answer here. The evaluator itself must be evolving.
Co-evolving Parasites Improve Simulated Evolution as an Optimization Procedure Danny Hillis, 1991
https://csmgeo.csm.jmu.edu/geollab/complexevolutionarysystem...
POET (Paired Open-Ended Trailblazer): https://www.uber.com/en-DE/blog/poet-open-ended-deep-learnin...
SCoE (Scenario co-evolution): https://dl.acm.org/doi/10.1145/3321707.3321831
Schmidhuber later defined "PowerPlay" as a framework for building up capabilities in a more practical way, which is more adaptive than just measuring the score on a fixed benchmark. A PowerPlay system searches for (problem, replacement) pairs, where it switches to the replacement if (a) the current system cannot solve that problem, (b) the replacement can solve that problem, and (c) the replacement can also solve all the problems that caused previous replacements (maintained in a list).
I formalised that in Coq many years ago ( http://www.chriswarbo.net/projects/powerplay ), and the general idea can be extended to (a) include these genetic-programming approaches, rather than using a single instance; and (b) could be seeded with desirable benchmarks, etc. to guide the system in a useful direction (so it's "self-invented" problems can include things like "achieves X% on benchmark Y")
How is this not a new way of over fitting?
Anyway, it does sound like overfitting the way it is described in this article. It's not clear how they ensure that the paths they explore stay rich.
This rabbit chase will continue until the entire system is reduced to absurdity. It doesn't matter what you call the machine. They're all controlled by the same deceptive spirits.
Do you mean tech bros?
That doesn’t imply that it’s feasible to perfectly specify what you actually want.
What we want of course is for the machine to do what we mean.
but how its that works then??? does that mean your genetic trait is already there in the first place
if its already there in the first place there must be something that start it now right, which basically counter your argument
Yes?
> if its already there in the first place there must be something that start it now right, which basically counter your argument
What?
Human children are not clones but in asexual reproduction can produce literal clones.
> You cannot "define" truth.
I can and it isn't even hard. "Truth" is a word. It has a definition... by definition.
Generally one can describe how people use a word. (Though, see “semantic primes”; there has to be some cycles in what words are defined using what words.)
I think the quotations around “define” were intentional in the comment you replied to. I think their point wasn’t to say something like the “undefinability of truth” paradox (the whole “truth is in the metalanguage” thing), but to say that it seemed to them that you were kind of sneaking in assumptions as part of definitions, or something like that, idk.
It appears that some miscommunication occurred, which may have been me being dense or misinterpreting. Do you know where this miscommunication occurred?
Evolution still can't be "gamed".
So… yeah…
"AI agent" roughly just means invoking the system repeatedly in a while loop, and giving the system a degree of control when to stop the loop. That's not a particularly novel or breakthrough idea, so similarities are not surprising.
Sure, we obviously weren't going to get to this point with only symbolic processing, but it doesn't have to be either/or. I think combining neural nets with symbolic approaches could lead to some interesting results (and indeed I see some people are trying this, e.g. https://arxiv.org/abs/2409.11589)
"Logical consistency" is exactly the kind of red herring that got us stuck with symbolic approach longer than it should. Humans aren't logically consistent either - except in some special situations, such as solving logic problems in school.
Nothing in how we think, how we perceive the world, categorize it and communicate about it has any sharp boundaries. Everything gets fuzzy or ill-defined if you focus on it. It's not by accident. It should've been apparent even then, that we think stochastically, not via formal logic. Or maybe the Bayesian interpretation of probabilities was too new back then?
Related blind alley we got stuck in for way longer than we should've (many people are still stuck there) is in trying to model natural language using formal grammars, or worse, argue that our minds must be processing them this way. It's not how language works. LLMs are arguably a conclusive empirical proof of that.
Footnote one validates your assumption.
It seems like the key contribution here is the discovery that anthropomorphizing genetic programming is more optimal for clicks/funding.
Saying it is optimizing some code sounds way less interesting than it is optimizing its own code.
First time I'm hearing abaut this. Feels like I'm always the last to know. Where else are the more bleeding edge publishing points for this and ML in general?
It's only been out a few days. You don't need to get the FOMO
About where to find them: arxiv. You can set up Google Scholar alerts for keywords, or use one of many recommendation platforms, such as https://scholar-inbox.com/
https://en.wikipedia.org/wiki/Genetic_programming
It also reminds me of Core War: https://en.wikipedia.org/wiki/Core_War#Core_War_Programming
Edit:
Guys, I'm not saying "no tests", the "Driven Development" part is important. I'm talking about this[0].
| Test-driven development (TDD) is a way of writing code that involves writing
| an automated unit-level test case that fails, then writing just enough code
| to make the test pass, then refactoring both the test code and the production
| code, then repeating with another new test case.
Your code should have tests. It would be crazy not toBut tests can't be the end all be all. You gotta figure out if your tests are good, try to figure out where they fail, and all that stuff. That's not TDD. You figure shit out as you write code and you are gonna write new tests for that. You figure out stuff after the code is written, and you write code for that too! But it is insane to write tests first and then just write code to complete tests. It completely ignores the larger picture. It ignores how things will change and it has no context of what is good code and bad code (i.e. is your code flexible and will be easy to modify when you inevitably need to add new features or change specs?).
What do you mean with this? I'm a software engineer, and I use TDD quite often. Very often I write tests after coding features. But I see a huge value coming from tests.
Do you mean that they can't guarantee bug free code? I believe everyone knows that. Like washing your hands: it won't work, in the sense you will still get sick. But less. So I'd say it does work.
It would be crazy for your code to not have tests...
In science we definitely don't let tests drive. You form a hypothesis, then you test that. But this is a massive oversimplification because there's a ton that goes into "form a hypothesis" and a ton that goes into "test that". Theory is the typical driver. The other usual one being "what the fuck was that", which often then drives theory but can simultaneously drive experimentation. But in those situations you're in an exploratory phase and there are no clear tests without the hypotheses. Even then, tests are not conclusive. They rule things out, not rule things in.
I believe the issue with „TDD“ is the notion that it should drive design and more importantly that it‘s always applied. I disagree with both if those.
Given a problem where test first makes sense, I prefer roughly this procedure:
1. Figure out assumptions and guarantees.
2. Design an interface
3. Produce some input and output data (coupled)
4. Write a test that uses the above
5. Implement the interface/function
The order of 4 and 5 aren‘t all that important actually.
My experience is that an AI is pretty good at 3, at least once you defined one example, it will just produce a ton of data for you that is in large parts useful and correct.
Step 4 is very easy and short. Again, AI will just do it.
Step 5 is a wash. If it doesn‘t get it in a few tries, I turn it off and implement myself. Sometimes it gets it but produces low quality code, then I often turn it off as well.
Step 1-2 are the parts that I want to do myself, because they are the significant pieces of my mental model of a program.
I believe this is also how evolutionary/genetic programs usually work if you squint. They operate under a set of constraints that are designed by a human (researcher).
Especially steps 1-2 are not things easy to hand off in the first place.
Step 6 it's important: reflect on your work and challenge it. I'm distinguishing this from 4 because you need to take the part of a strong adversary.
I'm not quite sure this is hire evolutionary programs work, having written plenty myself. I'd lean on no. I'm certain this is not the fill of my work as a researcher.
As a researcher you can't just put ideas together and follow some algorithm. There's no clear way to continue except in the incremental works. Don't get me wrong, those can do a lot of good, but they'll never get you anything groundbreaking. To do really novel things you need to understand details of what went on before. It's extremely beneficial to reproduce because you want to verify. When doing that you want to look carefully at assumptions and find what you're taking for granted. Maybe that's step 1 for you but step 1 is ongoing. The vast majority of people I meet fail to check their assumptions at even a basic level. Very few people want to play that game of 20 questions over and over being highly pedantic. Instead I hear "from first principles" and know what's about to follow is not a set of axioms. Carl Sagan bakes a pie from first principles. That's too far tbh, but you should probably mill your own flower (and that's still a long way from first)
Something that interests me is finding the right balance between assumptions and guarantees. If we don't look to closely, then weak assumptions and strong guarantees bring the most utility. But that always comes at a cost.
As merely a programmer I wonder this: You mentioned challenging your assumptions. How often does a researcher change their guarantees?
In the current hype cycle there are many different voices talking over each other and people trying stuff out. But I feel in the mid or long term there needs to be a discussion about being more pragmatic and tightening scope.
How important is that aspect for you? How long are you allowed (or do you allow yourself) to chase and optimize for an outcome before you reconfigure where you're heading?
> As merely a programmer
I hope you don't see my comment as placing myself as "better than thou". We have different skillsets, that's all. > You mentioned challenging your assumptions. How often does a researcher change their guarantees?
I'm not quite sure how to answer this tbh. Because I don't know what you mean. I'll reference Andrew Gelman on this one[0], mostly because the cross-reference is good and his blog has a lot of other insights | a guarantee comes from an assumption. If you want to say that your method has a guarantee but my method doesn’t, what you’re really saying is that you’re making an assumption and I’m not.
Really what we want in science is to generate counterfactural models. I started in physics before moving to CS (ML PhD) and I can sure tell you, at least this part was clearer in physics. F=ma[1] is a counterfactual model. I can change either m or a and make predictions. I can "go back in time" and ask how things would have been different. This is how we create good models of things. It's no easy task to get there though and it is far messier when you derive these simple equations than what they end up as. Think of it not too different than having a function vs "discovering" the function in a stack trace. You gotta poke and prod inside and out because you sure as hell know it isn't nicely labeled for you and you can't just grep the source code.But here's a difficult lesson every physicist has to learn. Experiments aren't enough. I think nearly every student will end up having an experience where they are able to fit data to some model only to later find out that that model is wrong. This is why in physics we tend to let theory drive. Our theory has gotten good enough we can do some general exploring of "the code" without having to run it. We can ask what would happen if we did x and then explore those consequences. Once we got something good, then we go test and we know exactly what to look for.
But even knowing what to look for, measurements are fucking hard (I was an experimentalist, that was my domain). Experiments are hard because you have to differentiate it from alternative explanations of the data. Theory helps a lot with this, but also isn't enough by itself.
> How long are you allowed to chase and optimize for an outcome before you reconfigure where you're heading?
There are no hard or fast rules, it is extremely circumstantial. First off, we're always dealing with unknowns, right? So you have to be able to differentiate your known knowns, known unknowns, unknown unknowns, and importantly, your uncertain knowns. Second, it depends on how strong your convictions are and what you believe the impact would be. Do you think you have the tools to solve this right now? If not, you should continue thinking about it but shift your efforts elsewhere. Insights might come later. But you have to admit that you are unable to do that now.What's important is figuring out what you would need to do to determine something. The skill is not that different than what we use in programming tbh. The difference really tends to be in the about of specificity. Programming and math are the same thing though. The reason we use these languages is due to their precision. When doing this type of work we can't deal with the fuzzy reality of natural language. And truth is, the specificity depends on your niche. So it all comes down to how strong your claims are. If you make strong claims (guarantees) you need extreme levels of specificity. First place people will look is assumptions. It's easy to make mistakes here and they will unravel everything else. But sometimes that leads to new ideas and can even improve things too.
So as a ML researcher, I love LLMs but hate the hype around them. There's no need to make such strong claims about AGI with them. We build fuzzy compression machines that can (lossy) compress all human knowledge and this can be accessed through a natural language interface. That's some fucking Sci-Fi tech right there! It feels silly to say they are more. We have no evidence. The only thing that results in is public distrusting us more when they see these things be dumb. Tech loves its hype cycles, like Elon promising that Teslas will be fully autonomous next year. A prediction he's made since 2016. Short term gains, but it is a bubble. If you can't fill the void in time, it pops and you harm not just yourself but others. That's a big problem.
Me? I just want to make progress towards making AGI. But I speak up because we don't even know what that looks like. We made massive leaps recently and we should congratulate ourselves for that. But with every leap forward we must also reflect. Success comes with additional burdens. It requires us to be more nuanced and specific. It means, what we likely need to do things differently. Gradient descent will tell you the same thing. You can make large gains in the beginning by taking non-optimal (naive) large steps towards what you think the optima is. But as you get nearer and nearer to the optima you can no longer act so naively and still make progress. Same is true here. Same is true if you look at the history of any scientific subject. You'll see this in physics too![2]
So to answer your question, how long? Well it depends on the progression of success and reflection after any milestones. We revisit "can I do this with the tools I have now", "what tools do I need", "can I make those tools", and "how would I find out". Those questions never stop being asked.
[0] https://statmodeling.stat.columbia.edu/2019/07/22/guarantee-...
[1] Technically this isn't the full form. But that is fine. In physics we deal with approximations too. They're often the most important parts. This is good enough for our purposes.
[2] https://hermiene.net/essays-trans/relativity_of_wrong.html
If you TDD outside in and tend to test from the edges of your stack, being conservative about moving your test coupling lower down the stack then it provides you with the freedom to change things underneath the hood and still have a body of tests that let you know you didnt break anything.
If you TDD inside out then yes you can and probably will create an enormous inflexible mess of tests that dont tell you if your code worked.
Sadly many tutorials teach it as "you want to write a class "Ball" you should write a test for that class first" which is wrongheaded.
Thats just writing tests badly though, it's not intrinsic to red-green-refactor.
The reason it fails for some people is the way they test though.
Having tests is pretty obviously different from TDD.
Ofc it's stochastic and sooner or later such a system will "break out", but if by then sufficient "superior systems" with good behavior are deployed and can be targeted to hunt it, the chance of it overpowering all of them and avoiding detection by all would be close to zero. At cosmic scales where it stops being close to zero, you're protected by physics (speed of light + some thermodyn limits - we know they work by virtue of the anthropic principle, as if they didn't the universe would've already been eaten by some malign agent and we wouldn't be here asking the question - but then again, we're already assuming too much, maybe it has already happened and that's the Evil Demiurge we're musing about :P).
so they basically created an billion dollar human?????, who wonder that we feed human behaviour and the output is human behaviour itself
So I am pretty skeptical of using such unsophisticated methods to create or improve such sophisticated artifacts.
TextGrad: Automatic "Differentiation" via Text: https://arxiv.org/abs/2406.07496
LLM-AutoDiff: Auto-Differentiate Any LLM Workflow : https://arxiv.org/abs/2501.16673
Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs: https://arxiv.org/abs/2406.16218
GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers: https://arxiv.org/abs/2412.09722
PromptWizard: Task-Aware Prompt Optimization Framework: https://arxiv.org/abs/2405.18369
But the problem was that the search space wasn't informative. The best 1 example didn't feature in the best 2 examples. So I couldn't optimise for 5, 6,7 examples..
| Approach | API calls | IO Tokens | Total tokens | Cost ($) |
|----------|-----------|-----------|---------------|----------|
| Instinct | 1730 | 67 | 115910 | 0.23 |
| InsZero | 18600 | 80 | 1488000 | 2.9 |
| PB | 5000 | 80 | 400000 | 0.8 |
| EvoP | 69 | 362 | 24978 | 0.05 |
| PW | 69 | 362 | 24978 | 0.05 |
They ascribe this gain in efficiency to a balance between exploration and exploitation that involves a first phase of instructions mutation followed by a phase where both instruction and few-shot examples are optimized at the same time. They also rely on "textual gradients", namely criticism enhanced by CoT, as well as synthesizing examples and counter-examples.What I gathered from reading those papers + some more is that textual feedback, i.e. using a LLM to reason about how to carry out a step of the optimization process is what allows to give structure to the search space.
I will have to read it - I will be looking to figure out if the tasks that they are working on significant/realistic? And are the improvements that they are finding robust?
Are the improvements robust? It's an evolving space, but the big win seems to be for smaller, open-source LLMs. These techniques can genuinely uplift them to near the performance of larger, proprietary models, which is massive for cost reduction and accessibility. For already SOTA models, the headline metric gains might be smaller single-digit percentages on very hard tasks, but this often translates into crucial improvements in reliability and the model's ability to follow complex instructions accurately.
"Textual gradient"-like mechanisms (or execution traces, or actual gradients over reasoning as in some newer work ) are becoming essential. Manually fine-tuning complex prompt workflows or prompts with many distinct nodes or components just doesn't scale. These automated methods provide a more principled and systematic approach to guide and refine LLM behavior.
So, less "spectacular" gains on the absolute hardest tasks with the biggest models, yes, but still valuable. More importantly, it's a powerful optimization route for making capable AI more efficient and accessible. And critically, it's shifting prompt design from a black art to a more transparent, traceable, and robust engineering discipline. That foundational aspect is probably the most significant contribution right now.
Again despite all the AI no one found the paper which gives the best bound to this (46):
48 complex scalar multiplications. Which is at least 3 real multiplications.
I have yet to read the paper and I know very little about the benchmarks the authors employed but why would they even feed logs produced by the agent into the reward function instead of objectively checking (outside the agent sandbox!) what the agent does & produces? I.e. let the agent run on some code base, take the final diff produced by the agent and run it through coding benchmarks?
Or, in case the benchmarks reward certain agent behavior (tool usage etc.) on the way to its goal of producing a high-quality diff, inspect processes spawned by the agent from outside the sandbox?
imho the main issue is an llm no has real sense of what’s a real tool call vs just a log of it, the text logs are virtually identical, ao the Llm starts also predicting these inatrad of calling the tool to run tests
its kinda funny
letting weaker agents still contribute? feels illegal but also exactly how dumb breakthroughs happen. like half my best scripts started as broken junk. it just kept mutating till something clicked.
and self-editing agents??? not prompts, not finetunes, straight up source code rewrites with actual tooling upgrades. like this thing bootstraps its own dev env while solving tasks.
plus the tree structure, parallel forks, fallback paths basically says ditch hill climbing and just flood the search space with chaos. and chaos actually works. they show that dip around iteration 56 and boom 70 blows past all. that’s the part traditional stuff never survives. they optimise too early and stall out. this one’s messy by design. love it.
drdeca•1d ago
Rest of the article was cool though!