I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.
On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.
A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.
For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.
People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?
I found LLMs make a fabulous frontend for git :-D
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.
Wild ensembles, squeezing a bit of loss out. More engineering than research IMO
If you're resource unconstrained then BO should ofc do very well though.
I wrote up some more notes on that here: https://simonwillison.net/2026/Mar/13/liquid/
It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.
So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!
Good lens.
The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.
[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch
i.e. perhaps minimal changes to autoresearch can take control for cost-effective research to occur.
The bottleneck in AI/ML/DL is always data (volume & quality) or compute.
Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?
There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.
Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.
I recall reading about a stochastic one years ago: <https://github.com/StanfordPL/stoke>
I wonder if the next step in "autoX" is to have an LLM generate dozens of candidates on a cluster and then get an LLM to figure out how to "mate" the two best ones or something. Trying to do this with regular evolutionary/genetic algorithms has always been challenging because how do you represent the gene to phenotype mapping? Let an LLM sort it out working just with the phenotypes - Lamarckian inheritance.
This has been the standard approach for more complex LLM deployments for a while now in our shop.
Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.
love2read•1h ago