HessFormer: Hessians at Foundation Scale
Diego Granziol
尽管在深度学习模型的优化领域取得了重大进展,其中最先进的开源混合专家模型参数量高达数百亿,但依赖Hessian向量积的方法仍然仅限于在单个GPU上运行,因此甚至无法应用于参数量级在十亿范围内的模型。我们发布了一个软件包HessFormer,它与著名的Transformers包很好地集成,并允许在具有多个GPU的单个节点上进行分布式Hessian向量计算。我们的实现底层是分布式随机Lanczos求积算法,我们将其公开发布。使用这个包,我们研究了最近的Deepseek 700亿参数模型的Hessian谱密度。
Whilst there have been major advancements in the field of first order optimisation of deep learning models, where state of the art open source mixture of expert models go into the hundreds of billions of parameters, methods that rely on Hessian vector products, are still limited to run on a single GPU and thus cannot even work for models in the billion parameter range. We release a software package HessFormer, which integrates nicely with the well known Transformers package and allows for distri...