The language is called Cursed.
We were curious to see if we can do away with IMPLEMENTATION_PLAN.md for this kind of task
If we actually want stuff that works, we need to come up with a new process. If we get "almost" good code from a single invocation, you just going to get a lot of almost good code from a loop. What we likely need is a Cucumberesque format with example tables for requirements that we can distill an AI to use. It will build the tests and then build the code to to pass the tests.
Like back in the day being brought in to “just fix” a amalgam of FoxPro-, Excel-, and Access-based ERP that “mostly works” and only “occasionally corrupts all our data” that ambitious sales people put together over last 5 years.
But worse - because “ambitious sales people” will no longer be constrained by sandboxes of Excel or Access - they will ship multi-cloud edge-deployed kubernetes micro-services wired with Kafka, and it will be harder to find someone to talk to understand what they were trying to do at the time.
Ok, now that is funny! On so many levels.
Now, for the project itself, a few thoughts:
- this was tried before, about 1.5 years ago there was a project setup to spam github with lots of "paper implementations", but it was based on gpt3.5 or 4 or something, and almost nothing worked. Their results are much better.
- surprised it worked as well as it did with simple prompts. "Probably we're overcomplicating stuff". Yeah, probably.
- weird copyright / IP questions all around. This will be a minefield.
- Lots of SaaS products are screwed. Not from this, but from this + 10 engineers in every midsized company. NIH is now justified.
Yeah, we're in weird territory because you can drive an LLM as a Bitcoin mixer over intellectual property. That's the entire point/meaning behind https://ghuntley.com/z80.
You can take something that exists, distill it back to specs, and then you've got your own IP. Throw away the tainted IP, and then just run Ralph over a loop. You are able to clone things (not 100%, but it's better than hiring humans).
except you dont
AI output isn't copyrighted in the US.
Is Unix “small sharp tools” going away? Is that a relic of having to write everything in x86 and we’re now just finally hitting the end of the arc?
Now I do a calculus with dependencies. Do I want to track the upstream, is the rigging around the core I want valuable, is it well maintained? If not, just port and move on.
Is that... the first recorded instance of an AI committing suicide?
One of the providers (I think it was Anthropic) added some kind of token (or MCP tool?) for the AI to bail on the whole conversation as a safety measure. And it uses it to their liking, so clearly not trying to self preserve.
"This business will get out of control. It will get out of control and we'll be lucky to live through it."
The alexandrian solution to the halting problem.
Python and Typescript are elaborate formal languages that emerged from a lengthy process of development involving thousands of people around the world over many years. They are non-trivially different, and it's neat that we can port a library from one to the other quasi-automatically.
The difficulty, from an economic perspective, is that the "agent" workflow dramatically alters the cognitive demands during the initial development process. It is plain to see that the developers who prompted an LLM to generate this library will not have the same familiarity with the resulting code that they would have had they written it directly.
For some economic purposes, this altering of cognitive effort, and the dramatic diminution of its duration, probably doesn't matter.
But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
Denial of this basic reality was an economic problem even before LLMs: how often did churn in a development team result in a codebase that no one could maintain, undermining the long-term prospects of a firm?
https://news.ycombinator.com/item?id=42592543
Great read overall, an interesting challenge to the conception that at its core, programming is about producing code.
This reminds me of a software engineering axiom:
When making software, remember that it is a snapshot of
your understanding of the problem. It states to all,
including your future-self, your approach, clarity, and
appropriateness of the solution for the problem at hand.
Isn't this the exact opposite of every other piece of advice we have gotten in a year?
Another general feedback just recently, someone said we need to generate 10 times, because one out of those will be "worth reviewing"
How can anyone be doing real engineering in such a: pick the exact needle out of the constantly churning chaos-simulation-engine that (crashes least, closest to desire, human readable, random guess)
I kind of agree that picking from 10 poorly-promoted projects is dumb.
The engineering is in setting up the engine and verification so one agent can get it right (or 90% right) on a single run (of the infinite ish loop)
They're almost certainly referring to first creating a fleshed out spec and then having it implement that, rather than just 100 words.
There are probably big oversights or errors in that short explanation. The LLM engine, the runner of the engine, and the specifics of some environment, make a lot of overlap and all of it is quite complicated.
hth
You want to go meta-meta? Get ralph to spawn subagents that analyze the process of how feedback and experimentation with techniques works. Perhaps allocate 10% of the time and effort to identifying what's missing that would make the loops more effective (better context, better tooling, better feedback mechanism, better prompts, ...?). Have the tooling help produce actionable ideas for how humans in the loop can effectively help the tooling. Have the tooling produce information and guidelines for how to review the generated code.
I think one of the big things missing in many of the tools currently available is tracking metrics through the entire software development loop. How long does it take to implement a feature. How many mistakes were made? How many errors were caught by tests? How many tokens does it take? And then using this information to automatically self-tune.
I would be scared to run this without knowing the exact cost.
Its not a good idea to do it without a payment cap for sure, its a new way to wake up with a huge bill the next day.
> We spent a little less than $800 on inference for the project. Overall the agents made ~1100 commits across all software projects. Each Sonnet agent costs about $10.50/hour to run overnight.
Apparently one of the lucky few who learned this special technique from Geoff just completed a $50k contract for $297. But that's not all! Geoff is generous to share the special secret prompt that unlocked this unbelievable success, if only we subscribe to his newsletter! "This free-for-life offer won't last forever!"
I am sceptical.
In any case, the writing style of that entire blog is off-putting. Gibberish from a massive ego.
That is pretty awesome and not something I would have expected from an agent; it hints (but does not prove) that it has some awareness of its own workings.
gregpr07•9h ago
dhorthy•9h ago
ghuntley•9h ago
cogogo•8h ago
ghuntley•8h ago
cogogo•8h ago
rukuu001•4h ago