From Attention to Disaggregation: Tracing the Evolution of LLM Inference
Madabattula Rajesh Kumar, Srinivasa Rao Aravilli, Mustafa Saify, Shashank Srivastava
大型语言模型从Transformer架构到具有数万亿参数的模型的演变,将主要瓶颈从模型训练转移到实时推理。 部署这些大规模模型是一个复杂的分布式系统挑战,受到内存带宽、计算吞吐量和延迟要求的限制。 LLM推理从根本上要求解决多目标优化问题,以最小化延迟,最大限度地提高吞吐量并降低成本。 本文探讨了向分类推理的必要架构转变,该推理应用了分布式系统原则,如服务分解、资源分类和工作负载分区,以克服传统单片GPU集群的局限性。 通过将计算密集型预填充阶段从内存密集型解码阶段解耦为独立可扩展的组件,该范式可缓解资源争议,并能够独立优化 Time to First Token 和 Inter Token Latency 等关键指标。
The evolution of Large Language Models from the Transformer architecture to models with trillions of parameters has shifted the primary bottleneck from model training to real time inference. Deploying these massive models is a complex distributed systems challenge constrained by memory bandwidth, computational throughput, and latency requirements. LLM inference fundamentally requires solving a multi objective optimization problem to minimize latency, maximize throughput, and reduce cost. This pa...