But it seems the biggest model available is 10B? Somewhat unusual and does make me wonder just how challenging it will be to train any model in the 100B order of magnitude.
The key insight of the BitNet paper was that using their custom BitLinear layer instead of normal Linear layers (as well as some more training and architecture changes) lead to much, much better results than quantizing an existing model down to 1.58 bits. So you end up making a full training run in bf16 precision using the specially adapted model architecture
You'd still need full GPUs for training, but for inference the hardware would be orders of magnitude simpler than what Nvidia is making
If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?
Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?
But it doesn't mean, idea is worthless.
You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.
I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)
Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.
> But it doesn't mean, idea is worthless.
I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.
And I did not speak out
Because I was not using em dashes
Then they claimed that if you're crammar is to gud you r not hmuan
And I did not spek aut
Because mi gramar sukcs
Then they claimed that if you actually read the article that you are trying to discuss you are not human...
However this user uses — in almost all his posts and he had a speed of 1 comment per minute or so on multiple different topics.
Edit: oh, just recalled dang restricted Show HNs the other day to only non-new users (possibly with some other thresholds). I wonder if word got out and some are filling accounts with activity.
I imagine you got 96gb because you thought you'd be running models locally? Did you not know the phrase Unified Memory is marketing speak?
In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction
One bit or one trit? I am confused!
For something more practical, you can pack five three-state values within a byte because 3^5 = 243, which is smaller than 256. To unpack, you divide and modulo by 3 five separate times. This encodes data in bytes at 1.6 bits per symbol.
But the packing of 5 symbols into a byte was not done here. Instead, they packed 4 symbols into a byte to reduce computational complexity (no unpacking needed)
>packed 4 symbols into a byte
microslop, typical bunch of two-bit frauds!
So it's not a inference framework for 1-bit models (two states per parameter) but for 1.58 bit models (three states per parameter). Annoying that they try to mix up the two.
demo shows a huge love for water, this AI knows its home
How? They can validate thousands if not millions of queries but nothing prevent the millions-th-and-one from being a hallucination. People who would then pay extra for a "Encyclopedia Britanica validated LLM" would then, rightfully so IMHO, complain that "it" suggested them to cook with a dangerous mushroom.
And I don't think that LLM could just Google or check Wikipedia.
But I do agree that this architecture makes a lot of sense. I assume it will become the norm to use such edge LLMs.
If you have an existing network, making an int4 quant is the better tradeoff. 1.58b quants only become interesting when you train the model specifically for it
On the other hand maybe it works much better than expected because llama3 is just a terrible baseline
My disappointment is immeasurable and my day is ruined.
A 100B ternary model packs to roughly 20-25GB (100B params at ~1.58 bits each). FP16 would be ~200GB, INT4 ~50GB. That difference is what moves the "doesn't fit" threshold. You go from needing HBM or multi-GPU NVLink to running on a workstation with 32GB DDR5.
DDR5 at ~100 GB/s is still much slower than HBM at ~3 TB/s, so memory bandwidth is still the inference bottleneck — but bandwidth is only a problem once the model actually fits. For 100B-class models, capacity was the harder constraint. That's what 1.58-bit actually solves.
The engineering/optimization work is nice, but this is not what people have been waiting for, as much as, can’t the Bitnet idea that seemed promise really deliver in a competitive way.
QuadmasterXLII•1h ago
Tuna-Fish•1h ago