One of those rare papers where the code speaks for itself. They do a bunch of comparisons but the most salient is comparing Karpathy's autoresearch (verbatim, best as I can tell) vs. some HPO algorithms, and as of yet the Tree-structured Parzen estimator still wins out -- but just barely!
More interesting though is that the best results come from 'centaur' approaches, where an LLM is hooked up with a standard HPO. Somewhere around 1:3 LLM:HPO control seems to work best, with more LLM control degrading performance. But either way this method far outperforms either the naive autoresearch loop or the bare HPO approach.
achierius•1h ago
> Centaur outperformed all methods including CMA-ES alone by using the LLM on only 30% of trials. The LLM receives CMA-ES's full internal state (mean vector, step-size, covariance matrix), the top-5 configurations, and the last 20 trials. A 0.8B LLM already suffices to outperform all classical and pure LLM methods. Scaling from 0.8B (0.9766) to 27B (0.9763) to Gemini Pro (0.9767) yields no improvement, suggesting a capability plateau [which Claude sliightly beats]
> We ablate the LLM ratio: higher ratios degrade performance, confirming that CMA-ES should retain majority control.
achierius•1h ago
More interesting though is that the best results come from 'centaur' approaches, where an LLM is hooked up with a standard HPO. Somewhere around 1:3 LLM:HPO control seems to work best, with more LLM control degrading performance. But either way this method far outperforms either the naive autoresearch loop or the bare HPO approach.
achierius•1h ago
> We ablate the LLM ratio: higher ratios degrade performance, confirming that CMA-ES should retain majority control.