在底层,启用节点间/节点内通信依赖于MPI和NVIDIA NCCL。使用这个软件栈,您可以在多个GPU上以张量并行模式运行LLM,以减少计算延迟。
除了用C语言编写的源代码外,FasterTransformer还提供了TensorFlow集成(使用TensorFlow op)、PyTorch集成(使用PyTorch op)和作为后端的Triton集成。
Optimizations in FasterTransformer
Layer fusion – The set of techniques in the pre-processing stage that combine multiple layers of NNs into a single one that would be computed with one single kernel. This technique reduces data transfer and increases math density, thus accelerating computation at the inference stage. For example, all the operations in the multi-head attention block can be combined into one kernel.
Inference optimization for autoregressive models / activations caching
To prevent recomputing the previous keys and values for each new token generator by transformer, FT allocates a buffer to store them at each step.
Although it takes some additional memory usage, FT can save the cost of recomputing, allocating a buffer at each step, and the cost of concatenation. The scheme of the process is presented in Figure 2. The same caching mechanism is used in multiple parts of the NN.
Memory optimization
Different from traditional models like BERT, large transformer models have up to trillions of parameters taking hundreds of GB of storage. GPT-3 175b takes 350 GB even if we store the model in half-precision. It’s therefore necessary to reduce memory usage for other parts.
For example, in FasterTransformer, we reuse the memory buffer of activations/outputs in different decoder layers. Since the number of layers in GPT-3 is 96, we only need 1/96 of the amount of memory for activations.
Usage of MPI and NCCL to enable inter/intra-node communication and support model parallelism
In the GPT model, FasterTransormer provides both tensor parallelism and pipeline parallelism. For tensor parallelism, FasterTransformer follows the idea of Megatron. For both the self-attention block and feed-forward network block, FT split the weights of the first matrix by row and split the weights of the second matrix by column. By optimization, FT can reduce the reduction operation to two times for each transformer block.
For pipeline parallelism, FasterTransformer splits the whole batch of requests into multiple micro-batches, hiding the bubble of communication. FasterTransformer will adjust the micro-batch size automatically for different cases.
MatMul kernel autotuning (GEMM autotuning)
Matrix multiplication is the main and the heaviest operation in transformer-based neural networks. FT uses functionalities from CuBLAS and CuTLASS libraries to execute these types of operations. It is important to know that MatMul operation can be executed in tens of different ways using different low-level algorithms at the “hardware” level.
GemmBatchedEx function implements MatMul operation and has “cublasGemmAlgo_t” as an input parameter. Using this parameter, you can choose different low-level algorithms for operation.
The FasterTransformer library uses this parameter to do a real-time benchmark of all low-level algorithms and to choose the best one for the parameters of the model (size of the attention layers, number of attention heads, size of the hidden layer) and for your input data. Additionally, FT uses hardware-accelerated low-level functions for some parts of the network such as __expf
, __shfl_xor_sync
Inference with lower precisions
FT has kernels that support inference using low-precision input data in fp16 and int8. Both these regimes allow acceleration due to a lower amount of data transfer and required memory. At the same time, int8 and fp16 computations can be executed on special hardware, such as the tensor cores (for all GPU architectures starting from Volta), and the transformers engine in the upcoming Hopper GPUs.
层融合 - 这是预处理阶段的一套技术,它将神经网络的多个层合并为一个单一层,该层将通过一个内核进行计算。这项技术减少了数据传输并增加了数学密度,从而加速了推理阶段的计算。例如,多头注意力块中的所有操作都可以合并到一个内核中。
针对自回归模型的推理优化 / 激活缓存
与BERT等传统模型不同,大型Transformer模型最多有数万亿个参数,需要数百GB的存储空间。即使我们以半精度存储模型,GPT-3 175b也需要350 GB。因此,减少其他部分的内存使用是必要的。
FT有内核,支持使用fp16和int8的低精度输入数据进行推理。由于数据传输和所需内存量较少,这两种模式都允许加速。同时,int8和fp16计算可以在特殊硬件上执行,如从Volta开始的所有GPU架构的张量核心,以及即将推出的Hopper GPU中的Transformers引擎。
TensorRT-LLM取代FT,可能也会实现类似于continuous batching类似的东西,ft-backend不维护了,但是会有TensorRT-LLM-backend