Distillation makes AI models smaller and cheaper

https://www.quantamagazine.org/how-distillation-makes-ai-models-smaller-and-cheaper-20250718/

164•pseudolus•4d ago

Comments

v3ss0n•1d ago

Sometimes better, sometimes dumber

flukas88•1d ago

Also makes openai moan about companies stealing from them when they stole the internet for free

tcldr•1d ago

Exactly. This is the argument that I find lacking from today's discourse: AI companies are already extracting generations worth of human intellectual data into their models. If they want to argue that this is 'fair use' then model distillation is, too. Can't have it both ways.

an0malous•1d ago

You can when the laws exist to serve the investor class instead of fairness and justice. There is a ludicrous amount of money in AI now, it has become a central initiative of the current administration and defense industry. The large AI companies will get whatever they want now.

miki123211•22h ago

Open AI is transforming those works, Deepseek is not.

OpenAI takes in code, books and articles and produces a model. This model can be used for novel tasks, like paraphrasing your own writing, translating your text to a different language, writing code according to a provided specification etc, even if there was nothing in the original corpus that exactly solved your problem.

To produce this model, you need four ingredients. The data, the compute, research effort and a lot of tedious RLHF work. While OpenAI uses the first one without providing author compensation (and it has no other option here), the latter three it provides entirely on its own.

People distilling from OpenAI do not create transformative works. They take Open AI's model and make a model of their own. Both models can do very similar things and are suitable for very similar purposes.

Distillation is just a particularly easy way of making an inexact copy of the model weights. The values of those weights will be very different, just as the values of each pixel in an illicit camera recording of a movie at a cinema are very different from those in the original version, but the net result is the same.

LearnYouALisp•22h ago

YOu mean making something sound like it was either written on Reddit or in a paper mill and requires effort to quickly find the material of value like a reading a machine-translation

tcldr•22h ago

Just because we're unable to compensate many millions, perhaps billions of people, for using their work without a) permission, or b) remuneration, doesn't justify giving a blanket license to use it without some form of *serious* compensation that reflects the gravity of what is being created.

The current winner-takes-all approach to the outcome is wholly inappropriate. AI companies right now are riding atop the shoulders of giants. Data, mathematics and science that humanity has painstakingly assembled discovered, developed and shared over millennia. Now, we're saying the companies that tip the point of discovery over into a new era should be our new intellectual overlords?

Not cool.

It's clear that model creators and owners should receive some level of reward for their work, but to discount the intellectual labour of generations as worthless is clearly problematic. Especially given the implications for the workforce and society.

Ultimately we'll need to find a more equitable deal.

Until then, forgive me if I don't have much sympathy for a company that's had its latest model distilled.

AdamConwayIE•21h ago

People always forget that back when OpenAI accused DeepSeek of distillation, o1's reasoning process was locked down, with only short sentences shared with the user as it "thought." There was a paper published in November 2024 from Shanghai Jiao Tong University that outlined how one would distill information from o1[1], and it even says that they used "tens of thousands" of o1 distilled chains. Given that the primary evidence given for distillation, according to Bloomberg[2], was that a lot of data was sent from OpenAI developer accounts in China in late 2024, it's not impossible that this (and other projects like it) could also have been the cause of that.

The thing is, given the other advances that were outlined in the DeepSeek R1 paper, it's not as if DeepSeek needed to coast on OpenAI's work. The use of GRPO RL, not to mention the training time and resources that were required, is still incredibly impressive, no matter the source of the data. There's a lot that DeepSeek R1 can be credited with in the LLM space today, and it really did signify a number of breakthroughs all at once. Even their identification of naturally emergent CoT through RL was incredibly impressive, and led to it becoming commonplace across LLMs these days.[3]

It's clear that there are many talented researchers on their team (their approach to MoE with its expert segmentation and expert isolation is quite interesting), so it would seem strange that with all of that talent, they'd resort to distillation for knowledge gathering. I'm not saying that it didn't happen, it absolutely could have, but a lot of the accusations that came from OpenAI/Microsoft at the time seemed more like panic given the stock market's reaction rather than genuine accusations with evidence behind them... especially given we've not heard anything since then.

https://github.com/GAIR-NLP/O1-Journey https://www.bloomberg.com/news/articles/2025-01-29/microsoft... https://github.com/hkust-nlp/simpleRL-reason

GuB-42•16h ago

It is complicated, and culture and legal systems will have to adapt.

But you can have it both way. Often, a distinction between fair and unfair is if are competing against the authors directly.

Take Ghibli memes for instance. While obviously the result of training on studio Ghibli content without permission, it doesn't compete against Studio Ghibli directly. Studio Ghibli doesn't draw memes and ChatGPT doesn't make feature films or copy official artwork, I don't think Studio Ghibli lost anything to the meme, they are not in the same business. So it could be considered fair use.

Training a LLM on data from a law firm to make a search engine directly competing against the search engine of said law firm is not fair use, and there is a legal precedent (Thomson Reuters vs Ross). Training your model from another model to compete against them would be the same kind of thing.

There are plenty of nuance, like how transformative it is. But it is possible that extracting massive amount of data is fair use but distillation is not. There are plenty of people at work on the question right now.

atmosx•1d ago

Funny how that works :-)

cma•15h ago

Not just that, o1 didn't even show its real chain of thought, yet OpenAI said deepseek distilled from them to make their reasoning model: distilling what wasn't there.

sebau•1d ago

I wonder how a company like OpenAI can be stolen/distilled via API without noticing, given the amount of data the is needed even for smaller models

oblio•1d ago

Corporate espionage or a distributed, concerted, scraping effort. Which would make OpenAI user counts completely useless, but it doesn't sound impossible. If anyone could pull this off, it's some Chinese company.

ben_w•1d ago

Stolen: There was some research a year or so ago that showed if you have access to the probability distribution for the next token, you can efficiently steal some layers of the model. When this work was done, OpenAI switched off direct access to those probabilities.

Distilled: Two years ago, one of the AI podcasts I was listening to (probably TWIML&AI) had someone use a big model to create a small high-quality training set for another model (as I understand it, this is what Microsoft's Phi series does, but that wasn't the example in whichever podcast I'm thinking of).

And remember, OpenAI's price for a million tokens is a rounding error for most businesses. Last year's reported revenue of USD 3.7 billion* suggests their customers collectively paid them for order-of a quadrillion tokens in and out, so even getting a trillion tokens from them without them noticing what you're up to (so long as you paid) is very plausible.

* https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...

NitpickLawyer•1d ago

The article is pretty light on details, and misses (or I missed it if they mentioned it) an important distinction. There are two main types of distillation:

- completion based methods, where you take a big model, give it some queries, and use the answers to post-train a smaller model. This is what deepseek did with qwen models, where they took ~800k traces made by R1 and used sft on smaller qwen2.5 models. What the sky team found in their experiments is that you can use as few as 1-2k traces to reach similar results. Much cheaper.

- logit/internal representations based methods, where you need access to the raw model, and for each pair q -> response you train the small model on the entire distribution of the logits at the same time. This is a method suited for model creators, where they can take a pair of big + small model of the same architecture, and "distill" it in the smaller one. This is likely how they train their -flash -mini -pico and so on.

The first method can be used via API access. The second one can't. You need access to things that API providers won't give you.

m12k•1d ago

From the article:

"Considering that the distillation requires access to the innards of the teacher model, it’s not possible for a third party to sneakily distill data from a closed-source model like OpenAI’s o1, as DeepSeek was thought to have done. That said, a student model could still learn quite a bit from a teacher model just through prompting the teacher with certain questions and using the answers to train its own models — an almost Socratic approach to distillation."

dr_dshiv•1d ago

Like PHI — textbooks are all you need. You can create entirely synthetic yet high quality training data with a strong model (the generated textbooks) and make very small models like PHI.

NitpickLawyer•1d ago

Right, my bad then I read it in a hurry. They do mention the distinction.

pyman•1d ago

This is exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers.

https://malted.ai/deepseek-and-the-future-of-distillation/

While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago, something that shook the core of Nvidia's business, China is now showcasing the world's first commercial unhackable cryptography system using QKD and post-quantum cryptography to secure all phone calls between Beijing and Hefei.

rcxdude•1d ago

>While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago

The whole reason they're accusing them of distilling their models is that this was a well-known technique that's relatively easy compared to creating or improving on one in the first place. Deepseek was impressive for how lean it was (and it shook the markets because it demonstrated obviously what the savvier observers already had figured, that the big AI companies in the US didn't have a huge moat), but they certainly did not come up with this concept.

pyman•1d ago

OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.

ben_w•1d ago

More like the Egg of Columbus or the Red Queen.

You need to run as hard as you can just to stay where you are, and once you've got the answer it's very much easier to reproduce the result.

This is of course also what annoys a certain fraction of commenters in every discussion about LLMs (and in art, diffusion models): they're overwhelmingly learning from the examples made by others, not investigating things for themselves.

While many scientists will have had an example like Katie Mack's viral tweet* with someone who doesn't know what "research" even is in the first place and also mistakes "first thing I read" for such research, the fact many humans also do this doesn't make the point wrong when it's about AI.

* https://paw.princeton.edu/article/katie-mack-09-taming-troll

pyman•1d ago

So what are you trying to say?

Do you agree that OpenAI and Anthropic are still claiming they need more data centres and more Nvidia servers to win the AI race, while still trying to understand what China actually did and how they did it?

ben_w•1d ago

"while" makes the whole false.

> Do you agree that OpenAI and Anthropic are still claiming they need more data centres and more Nvidia servers to win the AI race

Yes. Red Queen[0].

> while still trying to understand what China actually did and how they did it?

No. Egg of Columbus[1]. They're well aware of what DeepSeek did. Just as DeepSeek could easily reproduce American models, the DeepSeek models are not particularly challenging works for any other AI company to follow, understand, and build upon. Here's someone else's reproduction of what they did: https://huggingface.co/blog/open-r1

That it's so easy for these companies to keep up with each other is *the reason why* there's a Red Queen[0] race.

[0] https://en.wikipedia.org/wiki/Red_Queen's_race

[1] https://en.wikipedia.org/wiki/Egg_of_Columbus

pyman•23h ago

Got it now, thanks for explaining.

danieldk•1d ago

What? Distillation is way older. The Hinton paper was from 2015 (maybe there is even earlier work):

https://arxiv.org/abs/1503.02531

When I was still in academia, we were distilling models from BERT/RoBERTa-large to smaller models (remember when those models were considered large?) in 2019 using logits and L2 distance of hidden layers. Before that we were also doing distillation of our own transformer/lstm models on model outputs (though with a different motivation than model compression, to learn selectional preferences, etc.).

pyman•1d ago

My point is: OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.

anonymoushn•1d ago

"subliminal learning" does not even work for use cases like distilling o1 to R1 because they do not share a base model

pyman•1d ago

Who's talking about that?

[Edit] My bad, I thought I was commenting on Anthropic's article

anonymoushn•1d ago

i replied to a comment by the hacker news user called pyman which claimed incorrectly that distillation was repackaged as "subliminal learning". so if you are asking me, who is talking about subliminal learning, which is unrelated to the topic of the article, the answer is that the hacker news user called pyman is doing that.

pyman•1d ago

Ah you are right, I was commenting on this article:

https://alignment.anthropic.com/2025/subliminal-learning/

dwohnitmok•1d ago

You're misunderstanding subliminal learning.

Subliminal learning is a surprising result that sheds more light on the process of distillation. It's not Anthropic trying to take credit for distillation.

In particular subliminal learning is the finding that a student model distilled from a teacher model has a communication channel with the teacher model that is extremely difficult to observe or oversee.

If you later fine-tune the teacher model on a very specific thing (in Anthropic's case fine-tuning the teacher to prefer owls over other animals) and then simply prompt the teacher model to output "random" digits with no reference to owls whatsoever, simply training the student model on this stream of digits results in the student model also developing a preference for owls over other animals.

This is a novel result and has a lot of interesting implications both for how distillation works as a mechanism and also for novel problems in overseeing AI systems.

pyman•23h ago

Sorry, I commented on the wrong article. I meant to post this under:

https://alignment.anthropic.com/2025/subliminal-learning/

Regarding your comment, yes, it's well known in the ML world that machines are way better than humans at picking up on correlations. In other words, the output of a model can carry traces of its internal state, so if another model is trained on those outputs, it can end up learning the patterns behind them.

What's contradictory is hearing companies say: "We wrote the software, but we don't fully understand what it's doing once it's trained on trillions of tokens. The complexity is so high that weird behaviours emerge."

And yet, at the same time, they're offering an API to developers, startups, and enterprise customers as if it's totally safe and reliable while openly admitting they don't fully know what's going on under the hood.

Question:

Why did Anthropic made its API publicly available? to share responsibility and distribute the ethical risk with developers, startups, and enterprise customers, hoping that widespread use would eventually normalise training models on copyrighted materials and influence legal systems over time?

Why are they saying "we don't know what's going on, but here's our API"? It's like Boeing saying: "Our autopilot's been acting up in unpredictable ways lately, but don't worry, your flight's on time. Please proceed to the gate.”

So many red flags.

visarga•1d ago

This is why SOTA LLMs can't manage to maintain a lead of more than a few months. There are half a million datasets on HuggingFace. Models are social, they learn from each other, learn from humans, and work together with humans and other models.

Animats•1d ago

A good question is whether you can grind down a model specialized for, say, customer service for your products, down to where it's really cheap to run on an ordinary server, maybe with a GPU card.

Are we really going to need all those giant AI data centers?

vasco•1d ago

Our brain works on a couple of bananas, so at least the amount of energy required for just inference doesn't look like it needs to be a lot. Training is another subject because we have that embedded in DNA and cultural behavior so its trickier.

TheFuzzball•1d ago

> Our brain works on a couple of bananas

What a fantastic non sequitur

seer•1d ago

Well in this analogy “training” is the thousands of cycles of sleep and moving and rearranging the brain cell connections that happens at night. That is _a lot_ of bananas, though obviously not all of the energy of growing up goes to brain re-arranging.

Still - shouldn’t be no more than a few buckets of fat, if you only do the nrem “training” bit of sleep.

stingraycharles•1d ago

No, that’s reinforcement learning and small incremental model updates. The real initial training & model deployment is more akin to DNA. Models cannot “learn” the same way humans do.

xwolfi•1d ago

Well yeah you have to look at the entire training duration for your brain. It did take a while to be as perfect as you seem to be, several billion years, and I'm sure you make mistakes sometimes and hallucinate stupid ideas.

And don't run too long on a couple bananas, the brain is not just there to infer, it also needs to manage its autonomous transport system which requires much more energy itself.

pama•1d ago

Silicon is already more efficient for inference than the brain. If we use centralized decoding of the V3/R1 scale models as a baseline, one can produce 720,000 tokens (a wild guess for the tokens humans could produce in 24 hours) using the energy of only 0.36 bananas. Deeply thinking humans expend up to a a third of their total energy on the brain, but cannot sustain themselves on a single banana per day.

(You can use an LLM to check this work at the cost of a tiny speck of a banana, eg: https://grok.com/share/c2hhcmQtMw%3D%3D_60f4890d-711b-4331-9... )

bildung•1d ago

Well compared to the human brain LLMs do approximately zero work. An LLM neuron is at least 3 orders of magnitude less complex than a neuron in the human brain - and this factor only accounts for the neuronal instrinsics we currently know of.

pama•1d ago

Agreed. And that near zero work has a near zero energy cost. In addition, silicon inference (combining hardware and software advances) continues to be optimized and become more energy efficient at a rapid rate.

There exists an unfounded myth surrounding the extreme energy costs of silicon-based inference, which is far from reality.

ben_w•1d ago

Agreed. I think this means the fair comparison is either:

  "transistors vs. *synapses*"

  "an entire integrated computer with all necessary cooling, including a modifier to account for the amortised training effort required to achieve human-quality output vs. the amortised energy requirements and output of a human over their lifetime".

Has to be human-quality output to be a fair comparison, a million lines of gibberish is worthless.

The human has to be educated up until 21 or so to be economically viable, retires in their late 60s, works 25% of the hours in a working week (but not at all on non-working week e.g. holiday, sickness, periods of unemployment, and while parental leave is work it isn't the specific work that people want to pay you for), and the brain itself is only ~20% of a human's calorific consumption.

In the (currently quite small number of) tasks where the AI we have is good enough to replace human labour, for some models it is already in the range where the marginal energy cost for inference is smaller than the energy cost (in food calories) to get a human to do the same thing.

But also, last I checked the peak performance of LLMs is not as high as a domain expert at anything, so even infinite cost into the AI isn't going to equal them. On the other hand, human intelligence is not equal for all of us, so I find it very easily believe that there's a significant fraction of the population who will always, over their lifetime, be behind today's SOTA AI, and therefore infinite time and energy for them isn't every going to equal the AI we already have.

Vetch•1d ago

The brain is certainly vastly more energy efficient at inference than LLMs on GPUs. But it looks like you're trying to make a different argument, that an LLM can spend less energy than a human to complete a given task. Unfortunately, you have not made that argument and I won't be reading unverified LLM output that might contain hallucinated steps or claims.

> V3/R1 scale models as a baseline, one can produce 720,000 tokens

On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.

> Deeply thinking humans expend up to a a third of their total energy on the brain

Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.

ben_w•23h ago

I agree with you that quality is the most important question, for similar reasons.

I don't think that current models are at expert level, but they do seem to be reliably good enough to be useful and pass standardised tests and be generally quite solidly in the "good enough you have to pay close attention for a while before you notice the stupid mistake" area that makes them very irritating for anyone running job interviews or publishing books etc.

And worse, I also think the numbers you're replying to are, at best, off by a few decimal places.

If I take the 0.36 bananas (which was already suspicious) and USD 0.1 / kWh, I get 0.004 USD. If I scale that up to by 1/0.72 to get a megatoken, that's still only 5/9ths of a cent.

If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap. It's close enough to the cost of their text-embedding-3-small (per megatoken) to be within the fudge-factor of my assumption about how much of their prices are electricity costs, but text-embedding are much much weaker than transformer models, to the point they're not worth considering in the same discussion unless you're making an academic point.

> It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%).

Indeed.

Now I'm wondering: how much power does the human brain use during an epileptic fit? That seems like it could plausibly be 70% of calories for a the few seconds of the seizure? But I've only got GCSE grade C in biology, so even with what I picked up the subsequent 25 years of general geeking, my idea of "plausible" is very weak.

pama•9h ago

> If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap

This assumption is very wrong. The primal cost factor in inference is the GPU itself. NVidia’s profit margins are very high; so are OpenAI’s margins for the API usage, even after taking into account the costs of the GPU. You can understand their margins if you read about inference at scale, and the lmsys blog in my parallel answer is a decent eye opener if you thought that companies sell tokens close to the price of electricity.

pama•9h ago

> > V3/R1 scale models as a baseline, one can produce 720,000 tokens On what hardware? At how many tokens per second? But most importantly, at what quality?

The hardware is the GB200 NVL72 by NVidia. This is for the class of 671B DeepSeek models, eg R1-0528 or V3, with their full accuracy setup (ie reproducing the quality of the reported DeepSeek benchmarks). Here is the writeup (by humans; the second figure shows the tokens per second per GPU as a function of the batch size, which emphasizes the advantages of centralized decoding, compared to current hacks at home): https://lmsys.org/blog/2025-06-16-gb200-part-1/

And here are the instructions to replicate the particular benchmark: https://github.com/sgl-project/sglang/issues/7227

The LLM text I linked in my original answer carries out the math using the energy consumption of the NVidia hardware setup (120kW) and rather simple arithmetic, which you can reproduce.

dragochat•1d ago

YES

We'll always find uses for more intelligence if it keeps getting more and more general (I don't like the term AGI bc. I think the "G" there is quantity not a quality, and humans are very low on generality too compared to what could be mathematically and physically possible for intelligence in our universe).

...we won't stop until the planet is papered with compute hardware UNLES we accelerate space development too (that's why SPACE is CRUCIAL!) and go grind the asteroid belt into thousands of datacenters too, then on and on.

There's a whole yummy lightcone that awaits to be eaten :P

yummybear•1d ago

Even further - could it download a distilled modeb runtime in response to your type of question - if we’re talking vacation planning download vacation.model for 10 seconds and then let’s talk?

ben_w•1d ago

We've already got distilled down versions of models designed to fit on consumer-sized devices, they are definitely not as performant as the bigger models.

But the models are RAM limited not compute limited, and there's no reason consumer devices need to have their current RAM limits. Get 256 GB of RAM in your phone and an LLM may drain the battery in 15 minutes, and I have no idea about the bus bandwidth, but the NPU (e.g. Neural Engine in Apple SoCs for the last few years) is already enough for the compute part of the problem.

msgodel•1d ago

You could probably use some heuristic on the tokens trained to try to weight customer service related data higher.

sebau•1d ago

For what it worth nearly all public models are distilled versions of bigger internal ones

arnaudsm•1d ago

Even flagships like o3 & gemini_2.5_pro ?

ffsm8•1d ago

I doubt you'll get a response from someone with authority on the matter (that actually worked on these models and is willing and authorized to post this publicly)... So I'm gonna add my uninformed consumer perspective:

I sincerely doubt the o3/2.5 pro haven't been distilled. It's unimaginable to me they're that price insensitive (or expressed inversely: were so thrifty in training that the final product can be used without optimization for the consumer usage)

the only conclusion I can come to is that they're indeed not letting you access the "root" models.

creshal•1d ago

I think OpenAI even mentioned in some papers that the internal o4(?) model used for some tests cost $6000 per query, pre-release.

That's absolutely getting distilled down for releases.

regularfry•1d ago

The more conservative version of this is that they'd want distilled models even if only as a speculative decoder to stick in front of the main model. That's an obvious optimisation to make.

wizardforhire•1d ago

Obligatory [1]

My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless.

Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15)

And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress.

[1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-...

meatmanek•1d ago

I'm surprised those things used neural networks. With a matrix of answer probabilities (trivially calculated from people's answers), you can choose the question that maximizes your expected information gain.

wizardforhire•22h ago

As I remember it, it was the break out moment for NN that made them mainstream to the masses. Prior to that they were an academic / hacker oddity relegated to works of fictions and just one of the many competing theories towards functioning AI. After 20Q you could buy a handheld NN at walmart. The delay to LLM was such that 20Q made it apparent to the scene that the limiting factor for more practical ai development was purely a scaling problem of complexity limited by compute power. A lot of conversations on /. and the likes centered around when the threshold would be crossed. Most at the time could not have predicted nor accepted that moore’s law would fail putting development back a decade.

To the credit of the naysayers at the time hotmail was still the primary free email service, gmail had yet to come out. Google buying up the darkfiber and had yet to open up their excess compute starting the arms race for the cloud. Most still thought of GPUs only for graphics even though their architecture and intent was there since their inception at thinking machines…

DoctorOetker•16h ago

pruning and distilling are 2 totally different things.

pruning: discarding low weight connections after training, makes the network sparser but also less regular (complications for memory layout, and compute kernels to access the sparse network weights).

distilling: take a large pretrained model, and train a smaller one from it, for example consider a cloze task (fill the blanked token in a sentence), then compute the probabilities using the large model, and train the smaller model to reproduce the same probabilities

distilling is a form of fitting into a smaller regular network, of potentially totally different architecture, while pruning is a form of discarding low weight coefficients resulting in a sparser network.

wizardforhire•10h ago

Thanks for taking the time to clarify for me.

pyman•1d ago

In 2024, DeepSeek's researchers used the DeepSeek-R1 model to transfer knowledge to a smaller model using distillation:

https://malted.ai/deepseek-and-the-future-of-distillation/

Honest question:

Isn't this exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers?

It's like if China claimed they invented the Transformer by renaming it the “Pattern Matching architecture.”

Why is Anthropic doing this? Isn't this the same company that recently scraped 7 million books? And now they’re “transforming” research papers too?

Icko_•1d ago

distillation and teacher-student models are definitely way older than 2024.

pyman•1d ago

rcxdude•1d ago

>and now Anthropic is repackaging it a year later, calling it “subliminal learning”

No, distillation and student/teacher is a well known technique (much older than even the original chatGPT), and Anthropic are not claiming to have invented it (it would be laughable to anyone familiar with the field). "subliminal learning" is an observation by Anthropic about something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher that is not obviously present in the information transferred between them (i.e. text outputted from the teacher and used to train the student. For example, the student's "favourite animal" changed despite the fact that the teacher was only creating 'random' numbers for the student to try to predict)

pyman•1d ago

> something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher

By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.

LLMs aren't black boxes doing voodoo, like we like to tell politicians and regulators. They're just software processing massive amounts of data to find patterns and predict what comes next. It looks magical, but it's maths and stats, not magic.

This post is just selling second-hand ideas. And for those of us outside the US who spend all day reading scientific papers, sorry Anthropic, we're not buying it.

ben_w•1d ago

> By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.

That's like saying Da Vinci figured out heavier-than-air flight. Useful foundation, obviously smart and on the right track, still didn't actually do enough to get all the credit for that.

> It looks magical, but it's maths and stats, not magic.

People keep saying "AI isn't magic, it's just maths" like this is some kind of gotcha.

Turning lead into gold isn't the magic of alchemy, it's just nucleosynthesis.

Taking a living human's heart out without killing them, and replacing it with one you got out a corpse, that isn't the magic of necromancy, neither is it a prayer or ritual to Sekhmet, it's just transplant surgery.

And so on: https://www.lesswrong.com/posts/hAwvJDRKWFibjxh4e/it-isn-t-m...

Even with access to the numbers and mechanisms, the inner workings of LLMs are as clear as mud and still full of surprises. Anthropic's work was, to many people, one such surprise.

pyman•1d ago

You can't compare software development with surgery, or writing code with transplanting a heart. One is reversible, testable, and fixable. The other involves real lives, real bodies, and no second chances.

ben_w•1d ago

I can and I have. Neither is "magic".

And plenty of software involves real lives, real bodies, and no second chances, e.g. Therac-25.

Unfortunately for all of us, it does look rather like people are already using clear-as-mud AI models for life-critical processes.

pyman•23h ago

You can't really compare the two. Yes, machines can (and do) fail, whether it's Therac-25, Tesla Autopilot, or Boeing's MCAS. Any software controlling a physical system carries risk. But unlike surgery, code is testable. You can run it in a sandbox, simulate edge cases, fix bugs, and repeat the process for days, months, or even years until it's stable enough for production. Surgeons don't get that luxury. They can't test a procedure on the same body before performing it. There's one shot, and the consequences are irreversible.

That said, I get your point, LLMs can be unpredictable because of the huge amount of data they're trained on and the quality of that data. You never really know what patterns they'll pick up or how they'll behave in edge cases, especially when the outputs aren't deterministic.

ben_w•23h ago

> You can't really compare the two.

You think one of them is magic?

If not, you're being needlessly pedantic as well as wrong.

> But unlike surgery, code is testable.

Surgeries are tested. Practice sessions are made. Animal tests for the general idea, cadavers to learn about humans, models for specific patients.

And code is, sadly, often pushed live without testing. Kills people, even.

rcxdude•14h ago

I don't think Alan Turing would have predicted the full sentence that I wrote there. The first half is not the interesting or surprising part! And of course it's not magic, but mathematics does in fact contain a lot of things we don't actually understand yet, and system like LLMs are in general something we don't have particularly robust mathematical frameworks for relating their structure to the observed behaviour (compared to other, much simpler, structures).

jgalt212•1d ago

Distillation formerly was the key to self-hosted usable models. However, the unceasing pressure to be "agentic", has made self-hosting once again untenable. Agentic tools just hover up too many tokens.

ricardobeat•1d ago

If they use more tokens isn’t that a case in favor of self-hosting to reduce costs? Or are you saying performance is not good enough for local agents?

regularfry•1d ago

More tokens in the context means disproportionately more VRAM, to the extent that you really do need multiple GPUs if you're running an interestingly-sized model.

FlyingLawnmower•1d ago

Sidenote, but the scholarship on distillation always makes me a bit sad. The Original work, cited in the abstract of the Hinton, Vinyals, and Dean paper that is cited everywhere, was the model compression work from Caruana, Buciluǎ, and Niculescu-Mizil.

The distillation paper added minor parameter tweaks and had a fancier name, but the essence of the method came from Caruana et. al's model compression paper: https://dl.acm.org/doi/abs/10.1145/1150402.1150464

cma•22h ago

1991 https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...

funfunfunction•1d ago

There are even companies starting to offer distillation as a service https://inference.net/explore/model-training

phreeza•3h ago

One mind-bending thing is that self-distillation, meaning distilling one model into another of the same architecture, number of parameters, etc., also often works! https://arxiv.org/abs/2206.08491

It's DE9, Not DB9

Who has the fastest F1 website (2021)

Up to date prices for LLM APIs all in one place

Programming Vehicles in Games

Dwl: Dwm for Wayland

Show HN: The Montana MiniComputer

Show HN: Apple Health MCP Server

Quantitative AI progress needs accurate and transparent evaluation

Graphene OS: a security-enhanced Android build

Celebrating 20 Years of MDN

When photography was born, fascination, obsession, and danger followed

Games Look Bad: HDR and Tone Mapping

Google spoofed via DKIM replay attack: A technical breakdown

High-speed organic light-emitting diodes achieving 4-Gbps communication

Lisp project of the day

3-JSON

Brazil central bank to launch Pix installment feature in September

Asciinema: Record and share your terminal sessions

My website is one binary (2022)

Air Canada returned lost bag, it now had knife,toiletries, ticket scanner inside

Google's shortened goo.gl links will stop working next month

Rapidus Starts 2nm Gate All Around Prototype Production at IIM-1

Nvidia Launches Family of Open Reasoning AI Models: OpenReasoning Nemotron

Nuclear Reactor SIM by PeteTimesSix

AMD CEO sees chips from TSMC's US plant costing 5%-20% more

Show HN: MCP server for up-to-date Zig standard library documentation

Qwen3-235B-A22B-Thinking-2507

Developing with Kiro: Amazon's New Agentic IDE

I wasted weeks hand optimizing assembly because I benchmarked on random data

There is no memory safety without thread safety

It's DE9, Not DB9

Who has the fastest F1 website (2021)

Up to date prices for LLM APIs all in one place

Programming Vehicles in Games

Dwl: Dwm for Wayland

Show HN: The Montana MiniComputer

Show HN: Apple Health MCP Server

Quantitative AI progress needs accurate and transparent evaluation

Graphene OS: a security-enhanced Android build

Celebrating 20 Years of MDN

When photography was born, fascination, obsession, and danger followed

Games Look Bad: HDR and Tone Mapping

Google spoofed via DKIM replay attack: A technical breakdown

High-speed organic light-emitting diodes achieving 4-Gbps communication

Lisp project of the day

3-JSON

Brazil central bank to launch Pix installment feature in September

Asciinema: Record and share your terminal sessions

My website is one binary (2022)

Air Canada returned lost bag, it now had knife,toiletries, ticket scanner inside

Google's shortened goo.gl links will stop working next month

Rapidus Starts 2nm Gate All Around Prototype Production at IIM-1

Nvidia Launches Family of Open Reasoning AI Models: OpenReasoning Nemotron

Nuclear Reactor SIM by PeteTimesSix

AMD CEO sees chips from TSMC's US plant costing 5%-20% more

Show HN: MCP server for up-to-date Zig standard library documentation

Qwen3-235B-A22B-Thinking-2507

Developing with Kiro: Amazon's New Agentic IDE

I wasted weeks hand optimizing assembly because I benchmarked on random data

There is no memory safety without thread safety

Distillation makes AI models smaller and cheaper

Comments