That said, fine tuning small models because you have to power through vast amounts of data where a larger model might be cost ineffective -- that's completely sensible, and not really mentioned in the article.
...which I thought was arguably the most popular use case for fine tuning these days.
Mostly referred to as model distillation, but I give the author the benefit of the doubt that they didn't mean that.
Is that true though? I don't think I've seen a vendor selling that as a benefit of fine-tuning.
If anything, I expect fine-tuning to destroy knowledge (and reasoning), which hopefully (if you did your fine-tuning right) is not relevant to the particular context you are fine-tuning for.
1) "excel at a particular task"
2) "train on proprietary or sensitive data"
3) "Complex domain-specific tasks that require advanced reasoning", "Medical diagnosis based on history and diagnostic guidelines", "Determining relevant passages from legal case law"
4) "The general idea of fine-tuning is much like training a human in a particular subject, where you come up with the curriculum, then teach and test until the student excels."
Don't all these effectively inject new knowledge? It may happen through simultaneous destruction of some existing knowledge but that isn't obvious to non-technical people.
OpenAI's analogy of training a human in a particular subject until they excel even arguably excludes the possibility of destruction because we don't generally destroy existing knowledge in our minds to learn new things (but some of us may forget the older knowledge over time).
I'm a dev with hand-waving level of proficiency. I have fine-tuned self-hosted small LLMs using PyTorch. My perception of fine-tuning is that it fundamentally adds new knowledge. To what extent that involves destruction of existing knowledge has remained a bit vague.
My hand-waving solution if anyone pointed out that problem would be to 1) say that my fine-tuning data will include some of the foundational knowledge of the target subject to compensate for its destruction and 2) use a gold standard set of responses to verify the model after fine-tuning.
I for one found the article quite valuable for pointing out the problem and suggesting better approaches.
Also... "LoRA" as a replacement for finetuning??? LoRA is a kind of finetuning! In the research community it's actually referred to as "parameter efficient finetuning." You're changing a smaller number of weights, but you're still changing them.
RAG is more oriented to temporary and variable situations.
In addition, LoRA is also a fine-tuning technology,and it is written in their paper.
Also the basic premise that knowledge injection is a bad use-case seems flawed? There are countless open models released by Google that completely fly in the face of this. Medgemma is just Gemma 3 4b fine-tuned on a ton of medical datasets, and it’s measurably better than stock Gemma within the medical domain. Maybe it lost some ability to answer trivia about Minecraft in the process, but isn’t that kinda implied by “fine-tuning” something? Your making it purpose built for a specific domain.
He's proposing alternatives he thinks are superior. He might well be right too, although I don't have a horse in the race but LORA seem like a more satisfying approach to get a result than retraining the model and giving LLMs tools seems to be proving more effective too.
So more imperfect is better?
Of course the model’s parameters leave a many billions of elements vector path for improvement. But what circuitous path is that, which it didn’t already find?
You can’t find it by definition if you don’t include all the original data with the tuning data. You have radically changed the optimization surface with no contribution from the previous data at all.
The one use case that makes sense is sacrificing functionality to get better at a narrow problem.
You are correct about that.
Now as far as how fine-tuning affects model performance, it is pretty simple: improves fit on the fine-tuning data, decreases fit on original training corpus. Beyond that, yeah, it is hard to say if fine-tuning will help you solve your problem. My experience has been that it always hurts generalization, so if you aren't getting reasonable results with a base or chat-tuned model, then fine-tuning further will not help, but if you are getting results then fine-tuning will make it more consistent.
"SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT SELECT ..."
I've hit this with gemini-2.0-flash and changing the prompt ever so slightly seems to make things work, just to break it at other input.
Andrej's 2019 blog laments on some of the reasons why it is hard and I can relate to a lot of this - https://karpathy.github.io/2019/04/25/recipe
The biggest mistake I see people making is this quote from the blog: "a 'fast and furious' approach to training neural networks does not work and only leads to suffering"
I'll probably write more about it in a few months...
Fitting a lora changes potentially useful information the same way that fine-tuning the whole model does. It's just the lora restricts the expressiveness of the weight update so that is compactly encoded.
We’ve fine-tuned open weight models for knowledge-injection, among other things, and get a model that’s better than OpenAI models at exactly one hyper specific task for our use case, which is hardware verification. Or, fine-tuned the OAI models and get significantly better OAI models at this task, and then only use them for this task.
The point is that a network of hyper-specific fine-tuned models is how a lot of stuff is implemented. So I disagree from direct experience with the premise that fine-tuning is a waste of time because it is destructive.
I don’t care if I “damage” Llama so that it can’t write poetry, give me advice on cooking, or translate to German. In this instance I’m only ever going to prompt it with: “Does this design implement the AXA protocol? <list of ports and parameters>”
The whole point of base models is to be general purpose, and fine tuned models to be tuned for specific tasks using a base model.
Really interested in the idea though! The dream is that you have your big, general base model, then a bunch of LoRa weights for each task you’ve tuned on, where you can load/unload just the changed weights and swap the models out super fast on the fly for different tasks.
Finetuning is good for, like you said, doing things a particular way but that’s not the same thing as being good at knowledge injection and shouldn’t considered as such.
It’s also much easier to prevent a RAG pipeline from generating hallucinated responses. You cannot finetune that out of a model.
> write poetry, give me advice on cooking, or translate to German
What finetuning makes less sense is doing it merely to get a model eg up to date with changes in some library, or to learn a new library it did not know, or, even worse, your codebase. I think this is what OP talks about.
It looked to me like the author did know that. The title only says "Fine-tuning", but immediately in the article he talks about Fine-tuning for knowledge injection, in order to "ensure that their systems were always updated with new information".
Fine-tuning to help it not make the stupid mistake that it makes 10% of the time no matter what instructions you give it is a completely different use case.
Because I have been working on replacing multiple humans handling complex business processes mostly end-to-end (with human in the loop somehow in there).
I find that I need the very best models to be able to handle a lot of instructions and make the best decisions about tool selection. And overall I just need the most intelligence possible to make fewer weird errors or misinterpretations of the instructions or situations/data.
I can see how fine tuning would help for some issues like some report formatting. But that output comes at the end of the whole process. And I can address formatting issues almost instantly by either just using a smarter model that follows instructions better, or adding a reminder instruction, or creating a simpler subtask. Sometimes the subtask can run on a cheaper model.
So it's kind of like the difference between building a traditional manufacturing line with very specific robot arms, tooling and and conveyor belts, versus plugging in just a few different humanoid robots with assembly manuals and access to more general purposes tools on their belt. You used to always have to build the full traditional line. In many cases that doesn't necessarily make sense anymore.
I expect it would greatly help characterize what was lost, at the expense of a great deal of extra computation. But with enough experiments might shed some more general light.
I suspect the smaller the tuning dataset, the faster and worse the overwriting will be, since the new optimization surface will be so much simpler to navigate than the much bigger datasets optimization surface.
Then a question might be, what percentage of the original training data, randomly retained, might slow general degradation.
What people expect from finetuning is knowledge addition. You want to keep the styling[1] of the original model, just add new knowledge points that would help your task. In context learning is one example of how this works well. Just that even here, if the context is out of distribution, a model does not "understand" it and would produce guesswork.
When it comes to LoRA or PEFT or adapters, it's about style transfer. And if you focus on a specific style of content, you will see the gains, just that the model wont learn new knowledge that wasnt already in original training data. It will forget previously learnt styles depending on context. When you do full finetuning (or SFT with no frozen parameters), it will alter all the parameters, and results in gain of new knowledge at the cost of previous knowledge (and would give you some gibberish if you ask about topics outside of domain). This is called catastrophic forgetting. Hence, yes, full finetuning works - just that it is an imperfect solution like all the others. Recently, with Reinforcement learning, there have been talks of continual learning, where Richard sutton's latest paper also lands at, but thats at research level.
Having said all that, if you start with the wrong mental model for Finetuning, you would be disappointed with the results.
The problem to solve is about adding new knowledge, while preserving the original pretrained intelligence. Still in wip, but we published a paper last year on one way it could be done. Here is the link: https://arxiv.org/abs/2409.17171 (it also has results for experiments all different approaches).
[1]: Styling here means the style learned by the model in SFT. Eg: Bullets, lists, bolding out different headings etc. all of that makes the content readable. The understanding of how to present the answer to a specific question.
Changes that happened:
1. LLMs got a lot cheaper but fine tuning didn't. Fine tuning was a way to cut down on prompts and make them 0 shot (not require examples)
2. Context windows became bigger. Fine tuning was great when it was expected to respond a sentence.
3. The two things above made RAG viable.
4. Training got better on released models, to the point where 0 shots worked fine. Fine tuning ends up overriding these things that were scoring nearly full points on benchmarks.
This highlights to me that the author doesn't know what they're talking about. LoRA does exactly the same thing as normal fine-tuning, it's just a trick to make it faster and/or be able to do it on lower end hardware. LoRA doesn't add "isolated subnetworks" - LoRA parameters are added to the original weights!
Here's the equation for the forward pass from the original paper[1]:
h = W_{0} * x + ∆W * x = W_{0} * x + B * A * x
where "W_{0}" are the original weights and "B" and "A" (which give us "∆W_{x}" after they're multiplied) are the LoRA adapter. And if you've been paying attention it should also be obvious that, mathematically, you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do, or you could even create a LoRA adapter from a fully fine-tuned model by calculating "W - W_{0}" to get ∆W and then do SVD to recover B and A.If you know what you're doing anything you can do with LoRA you can also do with full-finetuning, but better. It might be true that it's somewhat harder to "damage" a model by doing LoRA (because the parameter updates are fundamentally low rank due to the LoRA adapters being low rank), but that's a skill issue and not a fundamental property.
This made me laugh.
You seem like you may know something I've been curious about.
I'm a shader author these days, haven't been a data scientist for a while, so it's going to distort my vocab.
Say you've got a trained neural network living in a 512x512 structured buffer. It's doing great, but you get a new video card with more memory so you can afford to migrate it to a 1024x1024. Is the state of the art way to retrain with the same data but bigger initial parameters, or are there other methods that smear the old weights over a larger space to get a leg up? Anything like this accelerate training time?
... can you up sample a language model like you can lowres anime profile pictures? I wonder what the made up words would be like.
You have to be careful about the "same data" part though; ideally you want to train once on unique data[2] as excessive duplication can harm the performance of the model[3], although if you have limited data a couple of training epochs might be safe and actually improve the performance of the model[4].
[1] -- https://arxiv.org/abs/2312.15166
[2] -- https://arxiv.org/abs/1906.06669
You wrote exactly so I'm going to say "no". To clarify what I mean: LoRA seeks to accomplish a similar goal as "vanilla" fine-tuning but with a different method (freezing existing model weights while adding adapter matrices that get added to the original). LoRA isn't exactly the same mathematically either; it is a low-rank approximation (as you know).
> LoRA doesn't add "isolated subnetworks"
If you think charitably, the author is right. LoRA weights are isolated in the sense that they are separate from the base model. See e.g. https://www.vellum.ai/blog/how-we-reduced-cost-of-a-fine-tun... "The end result is we now have a small adapter that can be added to the base model to achieve high performance on the target task. Swapping only the LoRA weights instead of all parameters allows cheaper switching between tasks. Multiple customized models can be created on one GPU and swapped in and out easily."
> you can merge your LoRA adapter into the original weights (by doing "W = W_{0} + ∆W") which most people do
Yes, one can do that. But on what basis do you say that "most people do"? Without having collected a sample of usage myself, I would just say this: there are many good reasons to not merge (e.g. see link above): less storage space if you have multiple adapters, easier to swap. On the other hand, if the extra adapter slows inference unacceptably, then don't.
> This highlights to me that the author doesn't know what they're talking about.
It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
No, the author is objectively wrong. Let me quote the article and clarify myself:
> Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. [...] When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects. [...] Instead, use modular methods like [...] adapters.
This is just incorrect. LoRA is exactly like normal fine-tuning here in this particular context. The author's argument is that you should do LoRA because it doesn't do any "destructive overwriting", but in that aspect it's no different than normal fine-tuning.
In fact, there's evidence that LoRA can actually make the problem worse[1]:
> we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call intruder dimensions [...] LoRA fine-tuned models with intruder dimensions are inferior to fully fine-tuned models outside the adaptation task’s distribution, despite matching accuracy in distribution.
[1] -- https://arxiv.org/pdf/2410.21228
To be fair, "if you don't know what you're doing then doing LoRA over normal finetuning" is, in general, a good advice in my opinion. But that's not what the article is saying.
> But on what basis do you say that "most people do"?
On the basis of seeing what the common practice is, at least in the open (in the local LLM community and in the research space).
> I would just say this: there are many good reasons to not merge
I never said that there aren't good reasons to not merge.
> It seems to me you are being some combination of: uncharitable, overlooking another valid way of reading the text, being too quick to judge.
No, I'm just tired of constantly seeing a torrent of misinformation from people who don't know much about how these models actually work nor have done any significant work on their internals, yet try to write about them with authority.
Make sure the new training dataset is "large" by augmenting it with general data (see it as a sample of the original dataset), use PEFT techniques (freezing weights => less risks), use regularization (elastic weight consolidation).
Fine-tuning is fine, but will be more expensive that you thought and should be led by more experienced ML engineers. You probably don't need to fine tune models anyway.
j-wang•1d ago
Mainly including this article to spark discussion—I agree with some of this and not with all of it. But it is an interesting take.