OpenAI takes in code, books and articles and produces a model. This model can be used for novel tasks, like paraphrasing your own writing, translating your text to a different language, writing code according to a provided specification etc, even if there was nothing in the original corpus that exactly solved your problem.
To produce this model, you need four ingredients. The data, the compute, research effort and a lot of tedious RLHF work. While OpenAI uses the first one without providing author compensation (and it has no other option here), the latter three it provides entirely on its own.
People distilling from OpenAI do not create transformative works. They take Open AI's model and make a model of their own. Both models can do very similar things and are suitable for very similar purposes.
Distillation is just a particularly easy way of making an inexact copy of the model weights. The values of those weights will be very different, just as the values of each pixel in an illicit camera recording of a movie at a cinema are very different from those in the original version, but the net result is the same.
The current winner-takes-all approach to the outcome is wholly inappropriate. AI companies right now are riding atop the shoulders of giants. Data, mathematics and science that humanity has painstakingly assembled discovered, developed and shared over millennia. Now, we're saying the companies that tip the point of discovery over into a new era should be our new intellectual overlords?
Not cool.
It's clear that model creators and owners should receive some level of reward for their work, but to discount the intellectual labour of generations as worthless is clearly problematic. Especially given the implications for the workforce and society.
Ultimately we'll need to find a more equitable deal.
Until then, forgive me if I don't have much sympathy for a company that's had its latest model distilled.
The thing is, given the other advances that were outlined in the DeepSeek R1 paper, it's not as if DeepSeek needed to coast on OpenAI's work. The use of GRPO RL, not to mention the training time and resources that were required, is still incredibly impressive, no matter the source of the data. There's a lot that DeepSeek R1 can be credited with in the LLM space today, and it really did signify a number of breakthroughs all at once. Even their identification of naturally emergent CoT through RL was incredibly impressive, and led to it becoming commonplace across LLMs these days.[3]
It's clear that there are many talented researchers on their team (their approach to MoE with its expert segmentation and expert isolation is quite interesting), so it would seem strange that with all of that talent, they'd resort to distillation for knowledge gathering. I'm not saying that it didn't happen, it absolutely could have, but a lot of the accusations that came from OpenAI/Microsoft at the time seemed more like panic given the stock market's reaction rather than genuine accusations with evidence behind them... especially given we've not heard anything since then.
https://github.com/GAIR-NLP/O1-Journey https://www.bloomberg.com/news/articles/2025-01-29/microsoft... https://github.com/hkust-nlp/simpleRL-reason
But you can have it both way. Often, a distinction between fair and unfair is if are competing against the authors directly.
Take Ghibli memes for instance. While obviously the result of training on studio Ghibli content without permission, it doesn't compete against Studio Ghibli directly. Studio Ghibli doesn't draw memes and ChatGPT doesn't make feature films or copy official artwork, I don't think Studio Ghibli lost anything to the meme, they are not in the same business. So it could be considered fair use.
Training a LLM on data from a law firm to make a search engine directly competing against the search engine of said law firm is not fair use, and there is a legal precedent (Thomson Reuters vs Ross). Training your model from another model to compete against them would be the same kind of thing.
There are plenty of nuance, like how transformative it is. But it is possible that extracting massive amount of data is fair use but distillation is not. There are plenty of people at work on the question right now.
Distilled: Two years ago, one of the AI podcasts I was listening to (probably TWIML&AI) had someone use a big model to create a small high-quality training set for another model (as I understand it, this is what Microsoft's Phi series does, but that wasn't the example in whichever podcast I'm thinking of).
And remember, OpenAI's price for a million tokens is a rounding error for most businesses. Last year's reported revenue of USD 3.7 billion* suggests their customers collectively paid them for order-of a quadrillion tokens in and out, so even getting a trillion tokens from them without them noticing what you're up to (so long as you paid) is very plausible.
* https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...
- completion based methods, where you take a big model, give it some queries, and use the answers to post-train a smaller model. This is what deepseek did with qwen models, where they took ~800k traces made by R1 and used sft on smaller qwen2.5 models. What the sky team found in their experiments is that you can use as few as 1-2k traces to reach similar results. Much cheaper.
- logit/internal representations based methods, where you need access to the raw model, and for each pair q -> response you train the small model on the entire distribution of the logits at the same time. This is a method suited for model creators, where they can take a pair of big + small model of the same architecture, and "distill" it in the smaller one. This is likely how they train their -flash -mini -pico and so on.
The first method can be used via API access. The second one can't. You need access to things that API providers won't give you.
"Considering that the distillation requires access to the innards of the teacher model, it’s not possible for a third party to sneakily distill data from a closed-source model like OpenAI’s o1, as DeepSeek was thought to have done. That said, a student model could still learn quite a bit from a teacher model just through prompting the teacher with certain questions and using the answers to train its own models — an almost Socratic approach to distillation."
https://malted.ai/deepseek-and-the-future-of-distillation/
While Anthropic and OpenAI are still trying to make sense of what China's top computer scientists pulled off a year ago, something that shook the core of Nvidia's business, China is now showcasing the world's first commercial unhackable cryptography system using QKD and post-quantum cryptography to secure all phone calls between Beijing and Hefei.
The whole reason they're accusing them of distilling their models is that this was a well-known technique that's relatively easy compared to creating or improving on one in the first place. Deepseek was impressive for how lean it was (and it shook the markets because it demonstrated obviously what the savvier observers already had figured, that the big AI companies in the US didn't have a huge moat), but they certainly did not come up with this concept.
You need to run as hard as you can just to stay where you are, and once you've got the answer it's very much easier to reproduce the result.
This is of course also what annoys a certain fraction of commenters in every discussion about LLMs (and in art, diffusion models): they're overwhelmingly learning from the examples made by others, not investigating things for themselves.
While many scientists will have had an example like Katie Mack's viral tweet* with someone who doesn't know what "research" even is in the first place and also mistakes "first thing I read" for such research, the fact many humans also do this doesn't make the point wrong when it's about AI.
* https://paw.princeton.edu/article/katie-mack-09-taming-troll
Do you agree that OpenAI and Anthropic are still claiming they need more data centres and more Nvidia servers to win the AI race, while still trying to understand what China actually did and how they did it?
> Do you agree that OpenAI and Anthropic are still claiming they need more data centres and more Nvidia servers to win the AI race
Yes. Red Queen[0].
> while still trying to understand what China actually did and how they did it?
No. Egg of Columbus[1]. They're well aware of what DeepSeek did. Just as DeepSeek could easily reproduce American models, the DeepSeek models are not particularly challenging works for any other AI company to follow, understand, and build upon. Here's someone else's reproduction of what they did: https://huggingface.co/blog/open-r1
That it's so easy for these companies to keep up with each other is *the reason why* there's a Red Queen[0] race.
What? Distillation is way older. The Hinton paper was from 2015 (maybe there is even earlier work):
https://arxiv.org/abs/1503.02531
When I was still in academia, we were distilling models from BERT/RoBERTa-large to smaller models (remember when those models were considered large?) in 2019 using logits and L2 distance of hidden layers. Before that we were also doing distillation of our own transformer/lstm models on model outputs (though with a different motivation than model compression, to learn selectional preferences, etc.).
[Edit] My bad, I thought I was commenting on Anthropic's article
Subliminal learning is a surprising result that sheds more light on the process of distillation. It's not Anthropic trying to take credit for distillation.
In particular subliminal learning is the finding that a student model distilled from a teacher model has a communication channel with the teacher model that is extremely difficult to observe or oversee.
If you later fine-tune the teacher model on a very specific thing (in Anthropic's case fine-tuning the teacher to prefer owls over other animals) and then simply prompt the teacher model to output "random" digits with no reference to owls whatsoever, simply training the student model on this stream of digits results in the student model also developing a preference for owls over other animals.
This is a novel result and has a lot of interesting implications both for how distillation works as a mechanism and also for novel problems in overseeing AI systems.
https://alignment.anthropic.com/2025/subliminal-learning/
Regarding your comment, yes, it's well known in the ML world that machines are way better than humans at picking up on correlations. In other words, the output of a model can carry traces of its internal state, so if another model is trained on those outputs, it can end up learning the patterns behind them.
What's contradictory is hearing companies say: "We wrote the software, but we don't fully understand what it's doing once it's trained on trillions of tokens. The complexity is so high that weird behaviours emerge."
And yet, at the same time, they're offering an API to developers, startups, and enterprise customers as if it's totally safe and reliable while openly admitting they don't fully know what's going on under the hood.
Question:
Why did Anthropic made its API publicly available? to share responsibility and distribute the ethical risk with developers, startups, and enterprise customers, hoping that widespread use would eventually normalise training models on copyrighted materials and influence legal systems over time?
Why are they saying "we don't know what's going on, but here's our API"? It's like Boeing saying: "Our autopilot's been acting up in unpredictable ways lately, but don't worry, your flight's on time. Please proceed to the gate.”
So many red flags.
Are we really going to need all those giant AI data centers?
What a fantastic non sequitur
Still - shouldn’t be no more than a few buckets of fat, if you only do the nrem “training” bit of sleep.
And don't run too long on a couple bananas, the brain is not just there to infer, it also needs to manage its autonomous transport system which requires much more energy itself.
(You can use an LLM to check this work at the cost of a tiny speck of a banana, eg: https://grok.com/share/c2hhcmQtMw%3D%3D_60f4890d-711b-4331-9... )
There exists an unfounded myth surrounding the extreme energy costs of silicon-based inference, which is far from reality.
"transistors vs. *synapses*"
or "an entire integrated computer with all necessary cooling, including a modifier to account for the amortised training effort required to achieve human-quality output vs. the amortised energy requirements and output of a human over their lifetime".
Has to be human-quality output to be a fair comparison, a million lines of gibberish is worthless.The human has to be educated up until 21 or so to be economically viable, retires in their late 60s, works 25% of the hours in a working week (but not at all on non-working week e.g. holiday, sickness, periods of unemployment, and while parental leave is work it isn't the specific work that people want to pay you for), and the brain itself is only ~20% of a human's calorific consumption.
In the (currently quite small number of) tasks where the AI we have is good enough to replace human labour, for some models it is already in the range where the marginal energy cost for inference is smaller than the energy cost (in food calories) to get a human to do the same thing.
But also, last I checked the peak performance of LLMs is not as high as a domain expert at anything, so even infinite cost into the AI isn't going to equal them. On the other hand, human intelligence is not equal for all of us, so I find it very easily believe that there's a significant fraction of the population who will always, over their lifetime, be behind today's SOTA AI, and therefore infinite time and energy for them isn't every going to equal the AI we already have.
> V3/R1 scale models as a baseline, one can produce 720,000 tokens
On what hardware? At how many tokens per second? But most importantly, at what quality? I can use a PRNG to generate 7 billion tokens at a fraction of the energy use of an LLM but those tokens are not going to be particularly interesting. Simply counting how many tokens can be generated in a given time frame is still not a like for like comparison. To be complete, the cost required to match human level quality, if possible, also needs accounting for.
> Deeply thinking humans expend up to a a third of their total energy on the brain
Where did you get this from? A 70B LLM? It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%). This is because most of its energy use is spent on things like up-keep and maintaining resting membrane potential. Ongoing "Background activity" like the DMN also means the brain is always actively computing something interesting.
I don't think that current models are at expert level, but they do seem to be reliably good enough to be useful and pass standardised tests and be generally quite solidly in the "good enough you have to pay close attention for a while before you notice the stupid mistake" area that makes them very irritating for anyone running job interviews or publishing books etc.
And worse, I also think the numbers you're replying to are, at best, off by a few decimal places.
If I take the 0.36 bananas (which was already suspicious) and USD 0.1 / kWh, I get 0.004 USD. If I scale that up to by 1/0.72 to get a megatoken, that's still only 5/9ths of a cent.
If I make the plausible but not necessarily correct assumption that OpenAI's API prices reflect the cost of electricity, none of their models are even remotely that cheap. It's close enough to the cost of their text-embedding-3-small (per megatoken) to be within the fudge-factor of my assumption about how much of their prices are electricity costs, but text-embedding are much much weaker than transformer models, to the point they're not worth considering in the same discussion unless you're making an academic point.
> It's wrong or at best, does not make sense. The brain barely spends any more energy above its baseline when thinking hard (often not much more than 5%).
Indeed.
Now I'm wondering: how much power does the human brain use during an epileptic fit? That seems like it could plausibly be 70% of calories for a the few seconds of the seizure? But I've only got GCSE grade C in biology, so even with what I picked up the subsequent 25 years of general geeking, my idea of "plausible" is very weak.
This assumption is very wrong. The primal cost factor in inference is the GPU itself. NVidia’s profit margins are very high; so are OpenAI’s margins for the API usage, even after taking into account the costs of the GPU. You can understand their margins if you read about inference at scale, and the lmsys blog in my parallel answer is a decent eye opener if you thought that companies sell tokens close to the price of electricity.
The hardware is the GB200 NVL72 by NVidia. This is for the class of 671B DeepSeek models, eg R1-0528 or V3, with their full accuracy setup (ie reproducing the quality of the reported DeepSeek benchmarks). Here is the writeup (by humans; the second figure shows the tokens per second per GPU as a function of the batch size, which emphasizes the advantages of centralized decoding, compared to current hacks at home): https://lmsys.org/blog/2025-06-16-gb200-part-1/
And here are the instructions to replicate the particular benchmark: https://github.com/sgl-project/sglang/issues/7227
The LLM text I linked in my original answer carries out the math using the energy consumption of the NVidia hardware setup (120kW) and rather simple arithmetic, which you can reproduce.
We'll always find uses for more intelligence if it keeps getting more and more general (I don't like the term AGI bc. I think the "G" there is quantity not a quality, and humans are very low on generality too compared to what could be mathematically and physically possible for intelligence in our universe).
...we won't stop until the planet is papered with compute hardware UNLES we accelerate space development too (that's why SPACE is CRUCIAL!) and go grind the asteroid belt into thousands of datacenters too, then on and on.
There's a whole yummy lightcone that awaits to be eaten :P
But the models are RAM limited not compute limited, and there's no reason consumer devices need to have their current RAM limits. Get 256 GB of RAM in your phone and an LLM may drain the battery in 15 minutes, and I have no idea about the bus bandwidth, but the NPU (e.g. Neural Engine in Apple SoCs for the last few years) is already enough for the compute part of the problem.
I sincerely doubt the o3/2.5 pro haven't been distilled. It's unimaginable to me they're that price insensitive (or expressed inversely: were so thrifty in training that the final product can be used without optimization for the consumer usage)
the only conclusion I can come to is that they're indeed not letting you access the "root" models.
That's absolutely getting distilled down for releases.
My apologies for not being able to find the original tale. I’m sure the original website is around but this is a decent synopsis regardless.
Doesn’t look like they cover it in the article but if I remember correctly they pruned the model down to fit on 56k eprom that was able to be sold for originally $10 (also dating myself, this article claims $15)
And of course the jargon has changed with time, I guess were saying distilled now, originally we said pruned… because thats what you did once you had your weights you would prune the rest of the network to get the core model. I guess distilled works also, just less literal imho. I guess if we want to get really pedantic networks exists in liquids, but I digress.
[1] (apologies for the add crap, best I could find) https://www.mentalfloss.com/article/22269/how-electronic-20-...
To the credit of the naysayers at the time hotmail was still the primary free email service, gmail had yet to come out. Google buying up the darkfiber and had yet to open up their excess compute starting the arms race for the cloud. Most still thought of GPUs only for graphics even though their architecture and intent was there since their inception at thinking machines…
pruning: discarding low weight connections after training, makes the network sparser but also less regular (complications for memory layout, and compute kernels to access the sparse network weights).
distilling: take a large pretrained model, and train a smaller one from it, for example consider a cloze task (fill the blanked token in a sentence), then compute the probabilities using the large model, and train the smaller model to reproduce the same probabilities
distilling is a form of fitting into a smaller regular network, of potentially totally different architecture, while pruning is a form of discarding low weight coefficients resulting in a sparser network.
https://malted.ai/deepseek-and-the-future-of-distillation/
Honest question:
Isn't this exactly what the DeepSeek team did, and now Anthropic is repackaging it a year later, calling it “subliminal learning” or using the teacher and student analogy to take credit for work done by Chinese researchers?
It's like if China claimed they invented the Transformer by renaming it the “Pattern Matching architecture.”
Why is Anthropic doing this? Isn't this the same company that recently scraped 7 million books? And now they’re “transforming” research papers too?
No, distillation and student/teacher is a well known technique (much older than even the original chatGPT), and Anthropic are not claiming to have invented it (it would be laughable to anyone familiar with the field). "subliminal learning" is an observation by Anthropic about something surprising that can happen during the process, which is that, for sufficiently similar models, behaviour can be transferred from student to teacher that is not obviously present in the information transferred between them (i.e. text outputted from the teacher and used to train the student. For example, the student's "favourite animal" changed despite the fact that the teacher was only creating 'random' numbers for the student to try to predict)
By "behaviour" they mean data and pattern matching, right? Alan Turing figured that out in the 1940s.
LLMs aren't black boxes doing voodoo, like we like to tell politicians and regulators. They're just software processing massive amounts of data to find patterns and predict what comes next. It looks magical, but it's maths and stats, not magic.
This post is just selling second-hand ideas. And for those of us outside the US who spend all day reading scientific papers, sorry Anthropic, we're not buying it.
That's like saying Da Vinci figured out heavier-than-air flight. Useful foundation, obviously smart and on the right track, still didn't actually do enough to get all the credit for that.
> It looks magical, but it's maths and stats, not magic.
People keep saying "AI isn't magic, it's just maths" like this is some kind of gotcha.
Turning lead into gold isn't the magic of alchemy, it's just nucleosynthesis.
Taking a living human's heart out without killing them, and replacing it with one you got out a corpse, that isn't the magic of necromancy, neither is it a prayer or ritual to Sekhmet, it's just transplant surgery.
And so on: https://www.lesswrong.com/posts/hAwvJDRKWFibjxh4e/it-isn-t-m...
Even with access to the numbers and mechanisms, the inner workings of LLMs are as clear as mud and still full of surprises. Anthropic's work was, to many people, one such surprise.
And plenty of software involves real lives, real bodies, and no second chances, e.g. Therac-25.
Unfortunately for all of us, it does look rather like people are already using clear-as-mud AI models for life-critical processes.
That said, I get your point, LLMs can be unpredictable because of the huge amount of data they're trained on and the quality of that data. You never really know what patterns they'll pick up or how they'll behave in edge cases, especially when the outputs aren't deterministic.
You think one of them is magic?
If not, you're being needlessly pedantic as well as wrong.
> But unlike surgery, code is testable.
Surgeries are tested. Practice sessions are made. Animal tests for the general idea, cadavers to learn about humans, models for specific patients.
And code is, sadly, often pushed live without testing. Kills people, even.
The distillation paper added minor parameter tweaks and had a fancier name, but the essence of the method came from Caruana et. al's model compression paper: https://dl.acm.org/doi/abs/10.1145/1150402.1150464
v3ss0n•1d ago