Diffusion Beats Autoregressive in Data-Constrained Settings
Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak
自回归(AR)模型长期以来主导着大语言模型领域,推动了各类任务的进展。最近,基于扩散的语言模型作为一种有前景的替代方案出现,但其相对于AR模型的优势仍未得到充分探索。本文系统研究了数据受限场景下的掩码扩散模型——在有限数据上重复训练的情况——发现当计算资源充足但数据稀缺时,扩散模型显著优于AR模型。扩散模型能更有效地利用重复数据,获得更低的验证损失和更优的下游性能。我们将此优势解释为隐式数据增强:与AR模型固定的从左到右分解不同,掩码扩散使模型接触到多样化的标记顺序和预测任务分布。我们发现了扩散模型的新缩放定律,并推导出扩散模型开始优于AR模型的临界计算阈值闭式表达式。这些结果表明,当数据(而非计算)成为瓶颈时,扩散模型为标准AR范式提供了一个极具吸引力的替代方案。代码发布于:https://diffusion-scaling.github.io。
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is ab...