活水快报 - 42Digest

快速与简单：Triton中的2-Simplicial注意力机制

Fast and Simplex: 2-Simplicial Attention in Triton

Aurko Roy, Timothy Chou, Sai Surya Duvvuri, Sijia Chen, Jiecao Yu, Xiaodong Wang, Manzil Zaheer, Rohan Anil

arXiv

2025年7月3日

近期研究表明，训练损失随模型规模和token数量呈幂律关系，且要实现计算最优模型需要同时扩展模型规模和token数量。然而，这些缩放定律假设数据供应无限，且主要适用于计算受限的场景。随着现代大型语言模型日益依赖海量互联网规模数据集，它们处于计算受限的假设正变得不再成立。这一转变凸显了对token效率优先架构的需求。在本研究中，我们探索了2-simplicial Transformer的使用，该架构通过高效的Triton内核实现，将标准点积注意力推广至三线性函数。我们证明2-simplicial Transformer比标准Transformer具有更好的token效率：在固定token预算下，规模相近的模型在数学、编码、推理和逻辑任务上表现优于点积注意力版本。我们通过展示2-simplicial注意力改变了知识和推理任务缩放定律中的指数，与点积注意力相比，量化了这些优势。

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need f...

机器学习人工智能

View Source