活水快报 - 42Digest

ReplicationBench：AI智能体能否复现天体物理学研究论文？

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca

arXiv

2025年10月28日

前沿AI智能体作为科学研究助手展现出越来越大的潜力，最终可能对扩展性、开放性的研究工作流有所帮助。然而，为了将智能体用于新颖研究，我们必须首先评估其工作的基本忠实度和正确性。为了评估智能体作为研究助手的能力，我们引入了ReplicationBench，这是一个评估框架，测试智能体能否复现来自天体物理学文献的完整研究论文。天体物理学领域的研究严重依赖档案数据和计算研究，同时几乎不需要真实世界实验，是AI智能体在科学研究中特别有用的测试平台。我们将每篇论文分解为需要智能体复现论文核心贡献的任务，包括实验设置、推导过程、数据分析和代码库。每个任务都与原始论文作者共同开发，并针对关键科学结果，从而能够客观评估忠实度（对原始方法的遵循程度）和正确性（结果的技术准确性）。ReplicationBench对当前前沿语言模型极具挑战性：即使表现最佳的语言模型得分也不到20%。我们与领域专家合作分析ReplicationBench的执行轨迹，发现了智能体在科学研究中丰富多样的失败模式。ReplicationBench建立了首个论文规模、专家验证的天体物理学研究任务基准，揭示了可推广到其他数据驱动科学领域的智能体性能见解，并为衡量AI智能体在科学研究中的可靠性提供了可扩展的框架。

Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, w...

计算与语言天体物理学仪器与方法

View Source