Autoresearch on an old research idea

https://ykumar.me/blog/eclip-autoresearch/

119•ykumards•1h ago

Comments

love2read•1h ago

So... It did work. It found bugs (that he didn't know about) and it did optimization (that he hadn't done).

datsci_est_2015•1h ago

I often use LLMs to explore prior art and maybe find some alternative ways of thinking of problems. About 90% of what it tells me is useless or inapplicable to my domain due to a technicality it could not have known, but the other 10% is nice and has helped me learn some great new things.

I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.

On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.

MattGaiser•56m ago

> agent try everything that the LLM chatbot had recommended ($$$)

A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.

Eufrat•52m ago

I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember or things where even being flat out wrong is okay and you just do it yourself.

For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.

People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?

foobarian•35m ago

> I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember

I found LLMs make a fabulous frontend for git :-D

andy12_•32m ago

I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.

This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.

the_arun•1h ago

Try this if the main link is not responsive - https://archive.is/6xLiU

lamroger•1h ago

Awesome breakdown! It really feels like a hyper-hyper parameter search + bug fixer.

I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.

Wild ensembles, squeezing a bit of loss out. More engineering than research IMO

sdenton4•54m ago

For raw hyperparameter search, though, I would expect a proper Bayesian framework to be much better. Eg, vizier.

ainch•8m ago

I think it depends whether you can leverage some knowledge. It's possible for a person/LLM to look at a loss curve and say "oh that's undertraining, let's bump the lr" - whereas a Bayesian method doesn't necessarily have deeper understanding, so it'll waste a lot of time exploring the search space on poor options.

If you're resource unconstrained then BO should ofc do very well though.

BrokenCogs•58m ago

Does autoresearch work for projects that are not llm based? Eg in karpathy's example he is optimizing the nanogpt. What if I wanted to improve a Unet for image segmentation?

sdenton4•55m ago

The gist of these things is you point them at an eval metric and say 'make it go better.' so, you can point it at anything you can measure. The example in the blog post here is bonding boxes on wood cut images.

simonw•55m ago

Tobi from Shopify used a variant of autoresearch to optimize the Liquid template engine, and found a 53% speedup after ~120 experiments: https://github.com/Shopify/liquid/pull/2056

I wrote up some more notes on that here: https://simonwillison.net/2026/Mar/13/liquid/

Denzel•27m ago

How much did this cost? Has there ever been an engineering focus on performance for liquid?

It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.

simonw•19m ago

He used Pi as the harness but didn't say which underlying model. My stab-in-the-air guess would be no more than a few hundred dollars in token spend (for 120 experiments run over a few days assuming Claude Opus 4.6 used without the benefits of the Claude Max plan.)

So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!

bethekind•37m ago

I used it to speed up an codecompass-like repo from 86 files per second to 2000. Still haven't used the repo in production, so maybe it secretly broke things, but the ability to say: "optimize this benchmark and commit only if you pass these tests" is nice

carlsborg•55m ago

> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”

Good lens.

The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.

dvt•53m ago

Ok, so looking at the commit log[1], I was mostly interested in seeing what the "moonshot ideas" implementations looked like, but basically everything is just hyperparameter tuning. Which is nice, but likely not worth the $$$ spent on the tokens. Am I missing something here?

[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch

DoctorOetker•32m ago

It would seem wise to modify the autoresearch instructions to first estimate the computational costs rigorously and then sort and compare the proposals for human review, and for each actually executed attempt to feed back the computational costs with LoRa adapter?

i.e. perhaps minimal changes to autoresearch can take control for cost-effective research to occur.

mandevil•3m ago

Optuna or skopt are open source and won't take any GPU time at all to do it.

jpcompartir•43m ago

There are better techniques for hyper-parameter optimisation, right? I fear I have missed something important, why has Autoresearch blown up so much?

The bottleneck in AI/ML/DL is always data (volume & quality) or compute.

Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?

hun3•38m ago

> There are better techniques for hyper-parameter optimisation, right?

There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.

nextos•34m ago

AFAIK, it's a bit more than hyper-parameter tuning as it can also make non-parametric (structural) changes.

Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.

coppsilgold•23m ago

Perhaps LLM-guided Superoptimization: <https://en.wikipedia.org/wiki/Superoptimization>

I recall reading about a stochastic one years ago: <https://github.com/StanfordPL/stoke>

I wonder if the next step in "autoX" is to have an LLM generate dozens of candidates on a cluster and then get an LLM to figure out how to "mate" the two best ones or something. Trying to do this with regular evolutionary/genetic algorithms has always been challenging because how do you represent the gene to phenotype mapping? Let an LLM sort it out working just with the phenotypes - Lamarckian inheritance.

gwerbin•20m ago

It's an LLM-powered evolutionary algorithm.

ainch•10m ago

I'd like see a system like this take more inspiration from the ES literature, similar to AlphaEvolve. Let's see an archive of solutions, novelty scoring and some crossover rather than purely mutating the same file in a linear fashion.

_pdp_•24m ago

Take some working code. Ask an LLM to fix bugs. Measure performance and test coverage. Feed the results back into the LLM. Repeat.

This has been the standard approach for more complex LLM deployments for a while now in our shop.

Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.

cyanydeez•17m ago

Can we modify this approach to get LLMs that are good at specific programming languages or frameworks? That seems to be where local LLMs could really shine.

barrenko•4m ago

It's just RL-everything.

lucasay•13m ago

This feels less like automated research and more like structured trial and error with a decent feedback loop. Still useful, but I think the real bottleneck is how good your eval metric is. If that’s weak, the whole loop just optimizes for the wrong thing faster.

kridsdale1•6m ago

I mean, isn’t that “the scientific method”?

Computers This Week

Kamal, Rails deployments, and Rega turntables

Tappie-py – A cross-platform Homebrew GUI for macOS and Linux, built with Python

Semi-retirement, or, changing my relationship with the BSDs

UN warns of record 'climate imbalance' as planetary warming accelerates

Veevo Health – book a CT angiogram to see plaque buildup in your arteries

American Aviation Is Near Collapse

Ask HN: Are you also getting more angry with Claude as you use it for longer?

SpaceX to Expand Starlink's Mobile Coverage

Show HN: A game to teach teenagers coding in the age of AI

Viral DOGE Deposition Videos Can Remain Online, Judge Rules

OpenAI CEO Sam Altman Exits Helion Energy's Board

Cloudflare Details Upgrade to EPYC Turin for 2x Throughput, 50% Better Perf/Watt

Crib: Just Enough Devcontainers

Housing Advocates Don't Always Get Along

The Mac screenshot tool for builders

SpaceX hits back at Amazon in orbital datacenter dispute

Programatically exploring Linux /proc filesystem

Lc command – combines ls, cat, and nano – useful when you don't have home/end

Show HN: I Made an Open Source Swarm IDE

Markd. – Open annotation for research papers

Show HN: AI That Controls Cloudflare WAF, Stripe, and Supabase in Plain English

LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck Analysis

What I'm Learning from Aviation About Incident Preparedness

Language as the Architecture of General Intelligence in Humans and LLMs

We analyzed 134,000 legal AI interactions. Lawyers still win

'The Karpathy Loop': 700 experiments, 2 days

Show HN: Pglens – Postgres MCP server that lets agents look before they query

Most complex cloud service dependency chain you've seen?

Show HN: LLMs battle it out trading futures