SimpleFold: Folding Proteins is Simpler than You Think
Yuyang Wang, Jiarui Lu, Navdeep Jaitly, Josh Susskind, Miguel Angel Bautista
蛋白质折叠模型通常通过将领域知识整合到架构块和训练流程中,取得了突破性成果。然而,鉴于生成模型在不同但相关问题上取得的成功,很自然地会质疑这些架构设计是否是构建高性能模型的必要条件。在本文中,我们介绍了SimpleFold,这是第一个基于流匹配的蛋白质折叠模型,仅使用通用transformer块。蛋白质折叠模型通常采用计算昂贵的模块,涉及三角更新、显式对表示或为该特定领域定制的多个训练目标。相反,SimpleFold采用具有自适应层的标准transformer块,并通过生成流匹配目标以及额外的结构项进行训练。我们将SimpleFold扩展到30亿参数,并在约900万个蒸馏蛋白质结构以及实验性PDB数据上进行训练。在标准折叠基准测试中,SimpleFold-3B相比最先进的基线模型实现了有竞争力的性能,此外SimpleFold在集成预测方面表现出强大性能,这对于通过确定性重建目标训练的模型通常很困难。由于其通用架构,SimpleFold在消费级硬件上的部署和推理显示出高效性。SimpleFold挑战了蛋白质折叠中对复杂领域特定架构设计的依赖,为未来进展开辟了替代设计空间。
Protein folding models have achieved groundbreaking results typically via a combination of integrating domain knowledge into the architectural blocks and training pipelines. Nonetheless, given the success of generative models across different but related problems, it is natural to question whether these architectural designs are a necessary condition to build performant models. In this paper, we introduce SimpleFold, the first flow-matching based protein folding model that solely uses general pu...