I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.
I often find, if I've got a complicated solution, it’s because I haven’t fully examined the problem.
We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.
Title should be: Simple Self-Distillation Improves Code Generation
Many computer science paper titles allude to past titles in other CS papers.
Calling it “cringe worthy” is unnecessarily mean. There is context and history you don’t understand.
There are two distinct billions. https://en.wikipedia.org/wiki/Billion
> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.
In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).
What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.
I love that we're still learning the emergent properties of LLMs!
Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.
TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:
> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)
So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.
(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)
I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...
"Simple Self-Distillation". We had an acronym for Solid-State Drive. Don't know about that technique but the naming sure sound.. Simple?
That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.
Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.
Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.
Right now it feels like hammering a house onto a nail instead of the other way around.
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.
This effectively "folds" the logit tail truncation behavior into the model itself.
Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.
This feels eerily similar to sleep consolidation or synaptic pruning
jofzar•2h ago
Sorry apple, SSD is already taken, you can't use that acronym.
ape4•1h ago
love2read•1h ago
Consistency Preservation Update (CPU)
Guided Probability Update (GPU)
History-aware Distillation Driving (HDD)
Probability Smoothing Update (PSU)
drittich•1h ago