这里总结LLM高性能推理加速相关技术,包括几个方面:
- 计算图和OP优化
- 推理框架
- LLM结构
计算图和OP优化
PageAttention
推理库
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
LLMs承诺彻底改变我们在所有行业中使用人工智能的方式。然而,实际上为这些模型提供服务是具有挑战性的,即使在昂贵的硬件上也可能非常缓慢。今天,我们很高兴介绍vLLM,这是一个用于快速LLM推理和服务的开源库。vLLM利用了PagedAttention,我们新的注意力算法,有效地管理注意力键和值。配备PagedAttention的vLLM重新定义了LLM服务领域中最先进技术:它提供比HuggingFace Transformers高达24倍的吞吐量,而无需进行任何模型架构更改。
vLLM已经在加州大学伯克利分校开发,并在Chatbot Arena和Vicuna Demo部署了过去两个月。它是使得即使对于像LMSYS这样计算资源有限的小型研究团队来说也可以负担得起LLM服务核心技术。现在,在我们GitHub存储库中只需一个命令就可以尝试使用vLLM。
lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:
- Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
- Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
- Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
- Persistent Batch Inference: Further optimization of model execution efficiency.
4-bit Inference
This release adds efficient inference routines for batch size 1. Expected speedups vs 16-bit precision (fp16/bf16) for matrix multiplications with inner product dimension of at least 4096 (LLaMA 7B) is:
- 2.2x for Turing (T4, RTX 2080, etc.)
- 3.4x for Ampere (A100, A40, RTX 3090, etc.)
- 4.0x for Ada/Hopper (H100, L40, RTX 4090, etc.)
The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.
No code changes are needed to take advantage of the new kernels as long as a batch size of 1 is used.
Big thanks to @crowsonkb, @Birch-san, and @sekstini for some beta testing and helping to debug some early errors.