(1) Let the LLM randomly perturbate the system.
(2) Measure the system's performance.
(3a) If the perturbation improved performance, keep the change.
(3b) Otherwise, don't.
(4) Repeat
[1] https://github.com/karpathy/autoresearch> The agent did not know that would also halve the LUT count. It found out by doing it and watching the synthesizer.
So I guess this is an example of an LLM anthropomorphizing and making wild conjectures about the internal workings of a different LLM.
a fantastic opportunity to become the next next big thing and write a verifier verifier.
at the hypothesized inflexion point where AI instantly performs exactly as commanded, what happens to heavily regulated industries like medical? do we get huge leaps and bounds everywhere EXCEPT where it matters, or is regulation going to be handed over to a verifier verifier?
OP's post is basically pointing out what certainly many others have independently discovered: Your agent-based dev operation is as good as the test rituals and guard rails you give the agents.
sho_hn•1h ago
Nice detail on the encountered failures. Very similar experiences with my own loops against testsuites.
Great post. A snapshot in time.