大模型模型推理加速相关技术汇总

这里总结LLM高性能推理加速相关技术,包括几个方面:

  • 计算图和OP优化
  • 推理框架
  • LLM结构运行时系统架构

计算图和OP优化

  • KV Cache
  • GQA、MQA
  • FlashAttention v1 、v2
  • FlashDecoding

PageAttention

推理库

投机采样

大模型推理——FasterTransformer + TRITON

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

LLMs承诺彻底改变我们在所有行业中使用人工智能的方式。然而,实际上为这些模型提供服务是具有挑战性的,即使在昂贵的硬件上也可能非常缓慢。今天,我们很高兴介绍vLLM,这是一个用于快速LLM推理和服务的开源库。vLLM利用了PagedAttention,我们新的注意力算法,有效地管理注意力键和值。配备PagedAttention的vLLM重新定义了LLM服务领域中最先进技术:它提供比HuggingFace Transformers高达24倍的吞吐量,而无需进行任何模型架构更改。

vLLM已经在加州大学伯克利分校开发,并在Chatbot Arena和Vicuna Demo部署了过去两个月。它是使得即使对于像LMSYS这样计算资源有限的小型研究团队来说也可以负担得起LLM服务核心技术。现在,在我们GitHub存储库中只需一个命令就可以尝试使用vLLM。

lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
  • Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
  • Persistent Batch Inference: Further optimization of model execution efficiency.

4-bit Inference

This release adds efficient inference routines for batch size 1. Expected speedups vs 16-bit precision (fp16/bf16) for matrix multiplications with inner product dimension of at least 4096 (LLaMA 7B) is:

  • 2.2x for Turing (T4, RTX 2080, etc.)
  • 3.4x for Ampere (A100, A40, RTX 3090, etc.)
  • 4.0x for Ada/Hopper (H100, L40, RTX 4090, etc.)

The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.

No code changes are needed to take advantage of the new kernels as long as a batch size of 1 is used.

Big thanks to @crowsonkb, @Birch-san, and @sekstini for some beta testing and helping to debug some early errors.

LLM结构运行

continuous batching
in-flight batching

更多加速技术参考