prefill: 30.91 t/s, generation: 29.58 t/s
From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...The long context reasoning is something I haven't even seen in frontier models - I was running at 124k tokens earlier and it was still just buzzing along with no issues or fatigue.
I am amazed at how well it works, I'm using it right now for some pretty complex frontend work, and it is much much faster than, for example running a dense 27b or 31b model (like qwen or gemma) for me (The benefits of MoE) - but the long context capabilities have been what have been absolutely flooring me.
Super excited about this project and hope Antirez can keep himself from burning out - i've been following the repo pretty closely and there are a ton of PR's flooding in and it seems like he's had to do a lot of filtering out of slop code.
bjconlan•51m ago
Also have enjoyed playing with https://huggingface.co/HuggingFaceTB/nanowhale-100m-base (but early days for me understanding this space)