https://www.thelocal.se/20221125/swedish-word-of-the-day-bam...
Also, they sell Bamba at Trader Joe’s now.
[1] https://www.jacionline.org/article/S0091-6749(08)01698-9/ful...
For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.
OTOH if you had to remember a phone number to write it down, how does that differ?
As for SSMs - I think they compress model memory state way too much. Mixed global/local attention layers do just as well. And sparse/block attention seems like a way forward much more (https://arxiv.org/abs/2502.11089).
Yet all current model still suck above 32k. (Yes some can do needle in a haystack fine, but they still fail at anything even slightly more complex over a long context).
32k is still much higher than humans' though, so I agree with you that it gives them some kind of super human abilities over moderately long context, but they are still disappointingly bad over longer context.
None of these modern recurrent architecture have a way to do this.
MLA is probably the closest thing that is in-between both.
More recently, hybrid architectures that utilize attention plus other operators are gaining traction.
Love those GPQA scores hovering around 5% when chance (on 4-way multi-choice) would have got them 25%!
If the clock is running faster than regular time, it will at point catch up to regular time and thus be correct for a split second. If the clock is slower than regular time, regular time will catch up to the clock and the clock will be right for a split second.
Somehow this thing manages to accumulate an error of ~15 minutes in a month.
Could be right within 15 min accuracy in the appropriate timezone. And such a mechanism can be corrected for in the postprocessing step.
Procedural error in testing perhaps? I'm not familiar with the methodology for GPQA.
IBM is claiming at least a 2x inference speed-up with Bamba. Both groups say that future SSM optimizations to vLLM would lead to further inference speed improvement.
Btw bamba if given to kids at a young age can drastically reduce the chance of peanut allergies
SSM (state space model) -> SSSM (structured state space model) -> (it's like a snake ssss...) Mamba -> Bamba
This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.
I thought generally we were naming models "nB" by their number of params and treating quantisation as a separate concern. Are there any other models that instead treat the name as an indicative memory requirement?
Is this an attempt to hide that it fares poorly vs other ~18B parameter models?
EDIT: no, I just misunderstood
No it doesn't? The fact that it is 18 GB with 16 bit per parameter before quantization means that it is a 9B parameter model.
mh-•9mo ago
https://en.wikipedia.org/wiki/State-space_representation