Here is the awards page: https://cspaper.org/topic/116/record-breaking-acl-2025-crown...
Here is the awards page: https://cspaper.org/topic/116/record-breaking-acl-2025-crown...
Isn't it very notable that the latency improvement didn't have a performance loss? I'm not super familiar with all the technical aspects, but that seems like it should be one of the main focuses of the paper.
The awards page for ACL seems to disagree with this editorialized title: https://2025.aclweb.org/program/awards/
> Industry Track Awards
> Best Paper
> Speed Without Sacrifice: Fine-Tuning Language Models with Medusa and Knowledge Distillation in Travel Applications
> Daniel Zagyva, Emmanouil Stergiadis, Laurens van der Maas, Aleksandra Dokic, Eran Fainman, Ilya Gusev, Moran Beladev
Per TFA, the paper we’re looking for is this one:
> Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
> Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
I’m not finding it by author on the page you linked but I think it’s this reference by title:
> DeepSeek × PKU × UW — Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
I did find it on this page:
I have a suspicion with how quiet all the major players got after the two weeks after deepseek R1 was released that they were reading and implementing everything in the papers that came with it as fast as humanly possible.
I applaud their open efforts. But being "altruistic" and being best are two different things.
Their innovations in training efficiency were almost guaranteed to have been heavily considered by the big AI labs. For example, Dario Amodei talks about the efficiency improvements being the real important contribution of DeepSeek V3 here: https://www.darioamodei.com/post/on-deepseek-and-export-cont...
> DeepSeek's team did this via some genuine and impressive innovations, mostly focused on engineering efficiency. There were particularly innovative improvements in the management of an aspect called the "Key-Value cache", and in enabling a method called "mixture of experts" to be pushed further than it had before.
And the saltiness of US labs about DeepSeek is well-known. "O3, explain model distillation like I'm five."
No Sam, explain intellectual property rights to the judge in the NYT test case asshole.
Watching the Chinese labs kick the shit out of better funded US enclaves of TESCREAL psychopathy in the public fucking domain is gravy.
I don't care that their internal calculus or that of the PRC is to Cloud Strife Limit Break a bunch of "shareholder value" in the form of a bloated NVIDIA cap feeding frenzy by bloated "public benefit corporations" with a bunch of creepy ties to Thiel et al: they're publishing papers, code and weights. So they're hoovering up of the commons has something of value going back into the commons.
So yeah, fuck Sam and its going to be fun watching OpenAI and Anthropic pivot ever more towards trying to outlaw competition than they already have. Amodei already sounds like Donald Rumsfeld on Taiwan hawkishness, this is not the positioning of someone who loves their product roadmap.
It turns out that a zillion ScaleAI and SurgeAI turks don't have economics any better than paying NVIDIA to run 85% net earnings for CapEx that's obsolete by the time its racked and powered.
Native Sparse Attention matters. Your commentary is beneath this paper.
ChatGPT o1 was made generally available in December 2024 DeepSeek r1 open weights were released in January 2025
o-1 was mimicking the newest "explain your solution step by step" prompts which were proven to be more effective at the time.
ds-v1 came up with an actual chain of thought, imitating meandering and self doubt which sometimes went on for a while creating entertaining loops and introducing a new class of halting problem and this became the de-facto standard. They also revolutionized the industry by programming cards directly via PTX.
Then all of huggingface implemented the paper and we got q4 versions that "thought".
Hope that jolted your memory without killing Hitler.
CalmStorm•6mo ago