Edit: this might be because I’ve got flash attention enabled in Ollama.
The default should be open and portable APIs, not needlessly furthering a hegemony that is detrimental to us all.
Who in the world is expected to populate 11 select/text fields with their favorite model data points they just happen to have lying around, only to see an absolutely meaningless "295% Inference" outcome
What a dumpster
- No one trains big models in FP32 anymore.
- Gradients can also often be in BF16, and they don't actually have to be stored if you're not using gradient accumulation or if you're accumulating them directly in the optimizer's state.
- 32-bit Adam is silly; if you don't have infinite VRAM there's no reason why you wouldn't want to use 8-bit Adam (or you can go even lower with quantized Muon)
- Activations? They take up memory too, but are not mentioned.
It shows that to train a 3.77B parameter model I need 62GB of VRAM; just to give you some perspective for how overestimated this is: a few weeks back I was training (full fine-tuning, not LoRA) a 14B parameter model on 24GB of VRAM using every trick in the book to lower VRAM usage (to be fair, not all of those tricks are available in publicly available training harnesses, but the point still stands that even with an off-the-shelf training harness you can do a lot better than what this calculator suggests).
- you'll get something similar to gpt2.
- To approach the scale of modern LLMs, you'll need about 10x more than all the GPUs in the world.
It's a neat abstraction to consider these the same, but do you think Meta is paying 100M for writing a 15 line script?
Meta is paying the big bucks because to train a big LLM in a reasonable time you need *scale*. But the process itself is the same as full fine-tuning, just scaled up across many GPUs. If I would be patient enough to wait a few years/decades for my single GPU to chug through 15 trillion tokens then I could too train a Llama from scratch (assuming I feed it the same training data).
chlobunnee•1d ago
It helps compare GPU options by taking in simple parameters (# of transformer layers, token size, etc) and letting users know which GPUs are compatible + their efficiency for training vs inferencing.
The idea came from talking with ML researchers frustrated by slow cluster queues or wasting money on overkill GPUs.
I'd love feedback on what you feel is missing/confusing!
Some things I'm thinking about incorporating next are >Allowing users to directly compare 2 GPUs and their specs >Allowing users to see whether a fraction of the GPU can complete their workload
I would really appreciate your thoughts/feedback! Thanks!