Results are real but the setup is doing a lot of work. Every win here (scheduling, kernels, chip design) is in a domain with well-defined automated metrics and years of prior optimization. That's the ideal case for evolutionary search. The question isn't whether it works at Google, it's how much comes from the agent vs. the evaluation infrastructure wrapped around it.
stalfie•37m ago
Well, if the evaluation infrastructure is something humans could have had access to before, and that the agents key "skill" is just that it's a more patient and scalable worker, I would still argue that this "comes from the agent".
Humans get bored, inpatient, or run out of time, and so often give up in what they perceive to be a decent "local minima". Early verification harnesses using gpt-4 for optimizing robot reward functions succeeded quite well on the fact that the LLM just kept going (link below). As long as it is too boring for a human to use the same evaluation infrastructure, this is still an agent skill.
kadam2576•49m ago
stalfie•37m ago
Humans get bored, inpatient, or run out of time, and so often give up in what they perceive to be a decent "local minima". Early verification harnesses using gpt-4 for optimizing robot reward functions succeeded quite well on the fact that the LLM just kept going (link below). As long as it is too boring for a human to use the same evaluation infrastructure, this is still an agent skill.
https://arxiv.org/abs/2310.12931