大模型推理——FasterTransformer + TRITON

imoldpan · 2023 年7 月 21 日 05:45

回顾FasterTransformer

TensorRT-LLM将于10月份发布，提前回顾下。

张量并行是指每个张量被分割成多个块，每个张量块可以放置在一个单独的GPU上。在计算过程中，每个块在不同的GPU上分别并行处理，通过组合来自多个GPU的结果，可以计算出最终的张量。

流水线并行是指模型在深度上被分割，不同的完整层被放置到不同的GPU/节点上。

在底层，启用节点间/节点内通信依赖于MPI和NVIDIA NCCL。使用这个软件栈，您可以在多个GPU上以张量并行模式运行LLM，以减少计算延迟。

同时，TP和PP可以结合在一起，在多GPU和多节点环境中运行具有数十亿和数万亿参数（相当于数以TB计的权重）的大型Transformer模型。

除了用C语言编写的源代码外，FasterTransformer还提供了TensorFlow集成（使用TensorFlow op）、PyTorch集成（使用PyTorch op）和作为后端的Triton集成。

Optimizations in FasterTransformer

Layer fusion – The set of techniques in the pre-processing stage that combine multiple layers of NNs into a single one that would be computed with one single kernel. This technique reduces data transfer and increases math density, thus accelerating computation at the inference stage. For example, all the operations in the multi-head attention block can be combined into one kernel.

Inference optimization for autoregressive models / activations caching

To prevent recomputing the previous keys and values for each new token generator by transformer, FT allocates a buffer to store them at each step.

Although it takes some additional memory usage, FT can save the cost of recomputing, allocating a buffer at each step, and the cost of concatenation. The scheme of the process is presented in Figure 2. The same caching mechanism is used in multiple parts of the NN.

Memory optimization

Different from traditional models like BERT, large transformer models have up to trillions of parameters taking hundreds of GB of storage. GPT-3 175b takes 350 GB even if we store the model in half-precision. It’s therefore necessary to reduce memory usage for other parts.

For example, in FasterTransformer, we reuse the memory buffer of activations/outputs in different decoder layers. Since the number of layers in GPT-3 is 96, we only need 1/96 of the amount of memory for activations.

Usage of MPI and NCCL to enable inter/intra-node communication and support model parallelism

In the GPT model, FasterTransormer provides both tensor parallelism and pipeline parallelism. For tensor parallelism, FasterTransformer follows the idea of Megatron. For both the self-attention block and feed-forward network block, FT split the weights of the first matrix by row and split the weights of the second matrix by column. By optimization, FT can reduce the reduction operation to two times for each transformer block.

For pipeline parallelism, FasterTransformer splits the whole batch of requests into multiple micro-batches, hiding the bubble of communication. FasterTransformer will adjust the micro-batch size automatically for different cases.

MatMul kernel autotuning (GEMM autotuning)

Matrix multiplication is the main and the heaviest operation in transformer-based neural networks. FT uses functionalities from CuBLAS and CuTLASS libraries to execute these types of operations. It is important to know that MatMul operation can be executed in tens of different ways using different low-level algorithms at the “hardware” level.

GemmBatchedEx function implements MatMul operation and has “cublasGemmAlgo_t” as an input parameter. Using this parameter, you can choose different low-level algorithms for operation.

The FasterTransformer library uses this parameter to do a real-time benchmark of all low-level algorithms and to choose the best one for the parameters of the model (size of the attention layers, number of attention heads, size of the hidden layer) and for your input data. Additionally, FT uses hardware-accelerated low-level functions for some parts of the network such as __expf, __shfl_xor_sync.

Inference with lower precisions

FT has kernels that support inference using low-precision input data in fp16 and int8. Both these regimes allow acceleration due to a lower amount of data transfer and required memory. At the same time, int8 and fp16 computations can be executed on special hardware, such as the tensor cores (for all GPU architectures starting from Volta), and the transformers engine in the upcoming Hopper GPUs.

FasterTransformer中的优化

层融合 - 这是预处理阶段的一套技术，它将神经网络的多个层合并为一个单一层，该层将通过一个内核进行计算。这项技术减少了数据传输并增加了数学密度，从而加速了推理阶段的计算。例如，多头注意力块中的所有操作都可以合并到一个内核中。

针对自回归模型的推理优化 / 激活缓存

为了防止在每生成一个新令牌时都要重新计算先前的键和值，FT在每一步都分配一个缓冲区进行存储。

虽然这增加了一些额外的内存使用，但FT可以节省重新计算的成本、每步分配缓冲区的成本以及连接的成本。该过程的方案在图2中呈现。相同的缓存机制用于神经网络的多个部分。

内存优化

与BERT等传统模型不同，大型Transformer模型最多有数万亿个参数，需要数百GB的存储空间。即使我们以半精度存储模型，GPT-3 175b也需要350 GB。因此，减少其他部分的内存使用是必要的。

例如，在FasterTransformer中，我们在不同的解码器层中重用激活/输出的内存缓冲区。由于GPT-3的层数为96，我们只需要1/96的激活内存量。

使用MPI和NCCL实现节点间/节点内通信并支持模型并行

在GPT模型中，FasterTransformer提供了张量并行和流水线并行。对于张量并行，FasterTransformer采用了Megatron的思想。对于自注意力块和前馈网络块，FT通过行分割第一个矩阵的权重，并通过列分割第二个矩阵的权重。通过优化，FT可以将每个Transformer块的缩减操作减少到两次。

对于流水线并行，FasterTransformer将整个请求批次分割成多个微批次，隐藏了通信的泡沫。FasterTransformer将自动为不同的情况调整微批次的大小。

MatMul内核自动调整（GEMM自动调整）

矩阵乘法是基于Transformer的神经网络中的主要和最重的操作。FT使用CuBLAS和CuTLASS库的功能来执行这些类型的操作。需要知道，MatMul操作可以使用不同的低级算法在“硬件”级别以数十种不同的方式执行。

GemmBatchedEx函数实现了MatMul操作，并且有“cublasGemmAlgo_t”作为输入参数。使用这个参数，您可以为操作选择不同的低级算法。

FasterTransformer库使用此参数进行所有低级算法的实时基准测试，并选择最适合模型参数（注意层的大小，注意头的数量，隐藏层的大小）和您的输入数据的最佳算法。此外，FT对网络的某些部分使用硬件加速的低级函数，如__expf，__shfl_xor_sync。

使用较低精度的推理

FT有内核，支持使用fp16和int8的低精度输入数据进行推理。由于数据传输和所需内存量较少，这两种模式都允许加速。同时，int8和fp16计算可以在特殊硬件上执行，如从Volta开始的所有GPU架构的张量核心，以及即将推出的Hopper GPU中的Transformers引擎。

TensorRT-LLM取代FT，可能也会实现类似于continuous batching类似的东西，ft-backend不维护了，但是会有TensorRT-LLM-backend

参考

https://developer.nvidia.com/blog/accelerated-inference-for-large-transformer-models-using-nvidia-fastertransformer-and-nvidia-triton-inference-server/