Local inference is insanely fast on my M4 Pro MBP though, so I can understand where you're coming from, but I don't need it too much faster. I still need time to review, test, review and provide feedback to the model. Fast is okay I guess for true vibe coding.
My guess is that they're angling for an acquisition.
If the results persists from 1M to 12M, why not 24M or 48M? Sounds almost too good to be true.
With back of the napkin math from inside my head, that'd be like 0.5/1 million LOC, depending on language/code density, could just fold the entire codebase into one prompt if it's a small one, that'd be neat :)
> At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2.
Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO.
I get why they aren't disclosing all the details, but it seems more hype-train-esque to me for this moment. I don't disagree that this could be big.
EDM115•1h ago