Just wondering aloud --
Is there a tutorial/explainer by any chance that a beginner could use to follow along and learn how this is done.
The training dataset is very small, only including fashion-related pictures: https://github.com/yousef-rafat/miniDiffusion/tree/main/data...
Could you clarify what you mean by this part -- if the weights are taken from HF then what's the implementation for?
Anyway, I may try to train it on limited specialized dataset...
Can you be a bit more specific here? It's not clear what such a token is, what it takes to get one, or where it would be placed in get_checkpoints.py.
An API token from Hugging Face
> what it takes to get one
You generate them in your Hugging Face account
> where it would be placed in get_checkpoints.py.
Line 59 in the empty quotes where it says token = “”
That's the kind of thing that, stylistically speaking, it's good to define at the very top of the module.
The minRF project is very easy to start with training small diffusion models with rectified flow: https://github.com/cloneofsimo/minRF
Also, the reference implementation of SD 3.5 is actually minimalistic too: https://github.com/Stability-AI/sd3-ref
For example https://github.com/huggingface/transformers/issues/27961 OpenAI's tokenizer for CLIP is buggy, it's a reference implementation, it isn't the one they used for training, and the problems with it go unsolved and get copied endlessly by other projects.
What about Flux? They don't say it was used for training, it wasn't, there are bugs with it that break cudagraphs or similar that aren't that impactful. On the other hand, it uses CLIP reference, and CLIP reference is buggy, so this is buggy...
However, the keyword here is training / inference divergence. Unfortunately, nobody is going to spend multi-million to retrain a model, so our reimplementation needs to be bug-to-bug correct to use the trained weights properly. That's why the reference implementations are essential because it is from the original model trainers so you have the best "bet" on matching the training code properly.
To give you some concrete example of bugs we needs to maintain:
1. In SDXL, they use OpenClipG for text encoding, but wrongfully uses 0 as padding tokens (corresponding to symbol "!") whereas even for OpenClipG its own training, the endoftext token was used as padding token. However, if you switching SDXL to use endoftext token as padding token, due to training / inference divergence, you get subpar generated images.
2. In FLUX, we mainly use T5 as text encoder. However, T5 usually used as encoder with mask to exactly the same input length, to avoid extended impact of padding tokens. In FLUX, we don't apply mask for T5 text encoding, hence intuitively causing padding token to take more effect than it should. Again, "fixing" this bug without retraining you will get subpar generated images.
There are many examples like this, some are easier to fix some are not (HiDream uses a different ODE solver that is different than what we usually do for rectified flow, hence you need to negate its prediction to be compatible with existing samplers, but this is "easier to fix").
TL;DR: Yes, there are bugs in software, but we better to maintain bug-to-bug compatibility than trying to "fix" it, hence highlight the importance of a "done" reference implementation, rather than a usual "active" implementations in software industry otherwise.
(I maintain the most complete reimplementation of SoTA media generation models in Swift: https://github.com/drawthingsai/draw-things-community/tree/m.... So I tend to think that I know one or two about "reimplementation from scratch".)
The code outlining the network vs. the resultant weights. (Also vs. any training, inference, fine tuning, misc support code, etc.)
The theoretical diagram of how the code networks and modules are connected is math. But an implementation of that in code is copyrightable.
Afaik, the weights are still a grey area. Whereas code is code and is copyrightable.
Weights are not produced by humans. They are the result of an automated process and are not afforded copyright protection. But this hasn't been tested in court.
If OpenAI GPT 4o weights leak, I think the whole world could use it for free. You'd just have to write the code to run them yourself.
Then there are hyperparameters which are also needed to be known to use the weights with the model architecture.
Code is copyrightable and math is not. What about 'architecture'?
Has this actually been tested yet? Or are we still at the stage of AI companies trying to pretend this into reality?
And there's been a recent surge in abstractions over pytorch, and even standalone packages for models that you are just expected to import and use as an API (which are very useful, don't get me wrong!). So it's nice to see an implementation that doesn't have 10 different dependencies that each abstract over something pytorch does.
Andrej Karpathy did exactly that, and I think it’s quite interesting.
I have studiously avoided making models, though I've been adjacent to their output for years now... I think the root of my confusion is I kinda assumed there was already PyTorch based scripts for inference / training. (I assumed _at least_ inference scripts were released with models, and kinda figured fine-tuning / training ones were too)
So then I'm not sure if I'm just looking at a clean room / dirty room rewrite of those. Or maybe everyone is using "PyTorch" but it's usually calling into CUDA/C/some proprietary thingy that is much harder to grok than a pure PyTorch impl?
Anyways, these arent great guesses, so I'll stop myself here. :)
This package is basically running the model (inference) and maybe fine tuning it using existing AI weights. A great way to learn but still could run into same licensing issue.
I thought the community license stuff was about keeping people from using it in prod and charging for it without Stability getting at least a small taste.
This sucks.
I haven't been keeping up with gooner squad on Civit, but I did have some understanding SD was less popular, but I thought it was just because 3.5 came far too long after Flux with too little, if any, quality increase to be worth building new scaffolding for.
They don't want you finetuning it in specific ways that might make them look bad by association.
> with minimal dependencies
I haven't tried running SD 3.5 specifically, but it's built on hugging face libraries which I personally always find to be a mess of dependencies that make it really hard to setup without the exact configuration the original developers used (which is often not provided in enough detail to actually work). This makes it pretty hard to run certain models especially if it's a few months/years after the original release.
For example this appears to be the requirements for the stability AI reference implementation for SD3.5 and there are no versions specified and it includes "transformers" which is just an enormous library.
https://github.com/Stability-AI/sd3.5/blob/main/requirements...
IIRC it is written in an abstraction layer that supports a transformers-like API surface. This also makes it opaque to figure out what you're actually passing to the model, adding a Python dep mess on top of that...woo boy.
e: I'll also add that pytorch does still have one oddity on apple silicon which is that it considers each tensor to be 'owned' by a particular device, either a cpu or gpu. Macs have unified memory but pytorch will still do a full copy when you 'move' data between the cpu and gpu because it just wasn't built for unified memory.
self.q = nn.Linear(embed_size, embed_size, bias = False)
self.k = nn.Linear(embed_size, embed_size, bias = False)
self.v = nn.Linear(embed_size, embed_size, bias = False)
Try self.qkv = nn.Linear(embed_size, 3*embed_size, bias = False)
def forward(...):
...
qkv = self.qkv(x)
tldr: linear layers have an associative property
squircle•16h ago