Accelerating Training Speed of Tiny Recursive Models via Curriculum Guided Adaptive Recursion
Kaleem Ullah Qasim and Jiashu Zhang
递归推理模型通过迭代改进在复杂的推理任务上取得了显着的性能,使微小的网络能够匹配数千倍大小的大型语言模型。 然而,训练仍然计算昂贵,之前的工作报告每个数据集大约36个GPU小时,限制了更广泛的采用和研究。 我们提出了CGAR,一种新颖的培训方法,将课程学习应用于架构深度而不是传统数据排序。 CGAR引入了两个协同组件:Progressive Depth Curriculum在训练期间动态调整从浅到深配置的递归深度,防止早期过拟合,同时降低计算成本,以及分层监督加权对监督步骤具有指数衰减的重要性,将损失权重与观察到的梯度级衰减对齐。 在拥有423,168个测试谜题的数独极端中,CGAR实现了1.71倍的训练加速(10.93至6.38小时,42%的成本降低),精度下降仅为0.63%(86.65%至86.02%)。 系统消融显示,仅Progressive Depth课程就实现了2.26倍的加速,准确率为85.47%,展示了罕见的帕累托改进,其中建筑课程同时提高了培训效率和解决方案质量。 CGAR训练的模型表现出卓越的推理效率,100%的停止精度和11%的推理步骤。 我们的工作表明,关于架构深度的原则性课程能够在适度的硬件上有效地训练递归推理模型。 代码和模型:https://github.com/Kaleemullahqasim/CGAR和https://huggingface.co/Kaleemullah/trm-cgar-sudoku
Recursive reasoning models achieve remarkable performance on complex reasoning tasks through iterative refinement, enabling tiny networks to match large language models thousands of times their size. However, training remains computationally expensive, prior work reporting approximately 36 GPU-hours per dataset, limiting broader adoption and research. We propose CGAR, a novel training methodology that applies curriculum learning to architectural depth rather than traditional data ordering. CGAR ...