L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić
我们提出了一个通用的理论框架,用于理解基于双体相互信息扩展定律的长语建模,我们用自然语言严格验证。 我们证明双方相互信息捕获与传统的两点相互信息不同的多令牌交互和扩展,并表明这提供了准确建模长序列所需的依赖性的更完整的表征。 利用这种缩放定律,我们制定了长语模型(L^2M)条件,该条件降低了模型历史状态的必要缩放 - 负责存储过去信息的潜在变量 - 用于有效的长语境建模。 我们验证了该框架及其对变压器和状态空间模型的预测。 我们的工作提供了一个原则基础,以了解长上下文建模,并设计更高效的架构,具有更强的长语背景功能,具有超越自然语言的潜在应用。
We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging...