Benchmarking Simulacra AI's Quantum Accurate Synthetic Data Generation for Chemical Sciences
Fabio Falcioni, Elena Orlova, Timothy Heightman, Philip Mantrov, Aleksei Ustimenko
在这项工作中,我们将合成数据生成管道与最先进的Microsoft管道对标,在小型到大型系统的数据集上。 通过分析能量质量,自相关时间和有效的样本量,我们的研究结果表明,Simulacra的大波函数模型(LWM)管道与最先进的变式蒙特卡罗(VMC)采样算法相结合,将数据生成成本降低了15-50倍,同时保持能量精度的均等,与氨基酸规模上的传统CCSD方法相比,2-3倍。 这使得创建负担得起的大规模ab-initio数据集,加速了制药行业及其他领域的人工智能驱动的优化和发现。 我们的改进基于一种新颖和专有的采样方案,称为与Langevin自适应eXploration(RELAX)的Replica Exchange。
In this work, we benchmark 's synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra's Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to tr...