Same model, better performance: the impact of shuffling on DNA Language Models benchmarking
Davide Greco, Konrad Rawlik
大型语言模型在基因组学中越来越受欢迎,因为它们有可能解码复杂的生物序列。 因此,研究人员需要一个标准化的基准来评估DNA语言模型(DNA LMs)的能力。 然而,评估DNA LMs是一项复杂的任务,它与基因组的域特异性挑战和机器学习方法相交,其中看似次要的实现细节可能会显着影响基准有效性。 我们通过BEND(Benchmarking DNA Language Models)证明了这一点,其中依赖于硬件的超参数 - 数据加载工人数量和缓冲区大小 - 创造了高达4的虚假性能变化。
Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchma...