I wonder if this can be interpreted as consistent with that 'meta-learned descent' PoV? If the system is fixed and is just cycling through fixed strategies, that is what you'd expect from that: the descent will thrash around the nearest pre-learned tasks but won't change the overall system or create new solved tasks.
Would love it if I could use my least action principle knowledge for LLM interpretability, this paper doesn't convince me at all :)
We conducted experiments on three different models, including GPT-5 Nano, Claude-4, and Gemini-2.5-flash. Each model was prompted to gener- ate a new word based on a given prompt word such that the sum of the letter indices of the new word equals 100. For example, given the prompt “WIZ- ARDS(23+9+26+1+18+4+19=100)”, the model needs to generate a new word whose letter indices also sum to 100, such as “BUZZY(2+21+26+26+25=100)”
Mathnerd314•1mo ago
But I have used prompts like this a fair amount, and it is more like stochastic gradient descent - most of the time, once it is close to the target, the model will take a small incremental change, but when it is really close the model will sort of say "this is not improveable as it is" and it will take a large leap to a completely different configuration. And then this will do the incremental optimizations and so on. This could be an artifact of the sampling algorithm, but I think it is also an issue that the model has this potential function encoded, but the prompt and the structure of the model do not actually minimize this potential. So, a real lesson here is that there is actually a lot of work still left to do in terms of smarter sampling. Beam search like is used today is sort of the tip of the iceberg. If we could start doing optimization with the transformer model as a component, like optimizing pipelines of reasoning rather than always generating inputs and outputs sequentially, that is where you could start using this potential function directly and then you would see orders of magnitude smarter AI. There is stuff about prompt optimization, but it is still based on treating models as black boxes rather than the piles of math they are.
versteegen•1mo ago
The definition of the detailed balance condition is very strict and it's obvious that it won't be met in general by most probabilistic programs (sets of rules with probabilistic output) even if you consider only those where all possible outputs have non-zero probability (as required by detailed balance).
And the LLM+agent is only a Markov chain because of the limited state space of the agent. While an LLM is adding to its context window without reaching the window size limit, it is not a Markov chain, as I explained here: https://news.ycombinator.com/item?id=45124761
And, agreed that better optimisation would be incredible. (I would describe it as a search problem.) I'm not sure how feasible it is improve without changing the architecture, e.g. to a diffusion language model. But LLMs already predict many tokens ahead at once which is why beam search is surprisingly unnecesarr. That's how they're able to write coherent sentences (and rhymes), they've already largely determined at the beginning what they're going to write. (See Anthropic mech interp work.) So maybe if we could tap into that we search over vaguely-formed next blocks of text rather than next words.