TensorRT-9.0和TensorRT-LLM快要出来啦

TensorRT-LLM发布了,继承自fastertransformer,是大语言版本的tensorrt,依赖tensorr9.x版本去跑。

TensorRT-9.0

主要介绍下新特性:

  • 特殊版本的trt,有些架构不支持,比如图灵(来源 NVIDIA AI Inference Day 大模型推理线上研讨会),另外Torch-TensorRT目前继续使用trt-8.6,之后迁移到trt-10.0
  • 写TensorRT-plugin更方便了,可以使用python写plugin
  • 更新支持SD和NLP相关模型的支持和优化,与时俱进

详细可以看:

TensorRT-LLM介绍

TensorRT-LLM将FasterTransformer进行了进一步的增强,使其成为一个产品化的解决方案。使用TensorRT-LLM,AI开发者可以更简单地实现深度学习推理应用,并且能够通过优化的LLMs来提升性能。TensorRT-LLM保留了FasterTransformer的核心功能,并通过一个开源模块化Python API来支持新架构和增强功能,从而提高了易用性和可扩展性。有了这个新发布的开源代码,AI推理开发者现在可以部署生产级应用程序、降低成本、减少复杂性并改善整体用户体验。

TensorRT-LLM目前没有ONNX parser,不能走ONNX workflow,必须手工搭建模型。

现在主流搞大模型都是手动搭建网络,转onnx的话,需要拆成很多部分才可以转

TensorRT-LLM 核心部分

  • kv cache(学习vllm的page方法?)
  • 高度优化的self-attention(极致的性能优化)
  • 服务端优化(支持inflight batching,和continuous batching类似)

虽然取的名字提到LLM(Large Language Model,大语言模型),但其实TensorRT-LLM可以用来搭建任意AI模型,单卡多卡版本的都可以搞。

TensorRT-LLM将TensorRT、来自FasterTransformer的优化版kernel、预处理和后处理以及多GPU/多节点通信封装在一个Python API中,用于定义、优化和执行推理生产中的LLMs。

TensorRT-LLM 还包含创建Python 和 C++ 运行时的组件,这些运行时执行那些 TensorRT 引擎。它还包括一个后端,用于与 NVIDIA Triton inference server 集成。

TensorRT-LLM 的 Python API 的架构设计与 PyTorch API 类似。它为用户提供了一个 functional 模块,其中包含 einsumsoftmaxmatmulview 等函数。

layer 模块将用于组装LLMs的有用构建块捆绑在一起,例如 Attention 块、MLP 或整个 Transformer 层。模型特定的组件,如 GPTAttentionBertAttention,可以在 model 模块中找到。

TensorRT-LLM 为用户提供了预定义的模型,这些模型可以轻松地修改和扩展。TensorRT-LLM 的当前版本支持 BERTGPTNVIDIA GPT-2BGPT-JLLaMAOPTSantaCoderStarCoder

为了最大化性能并减少内存占用,TensorRT-LLM 允许使用不同的量化模式来执行模型(请参见 examples/gpt 以获取具体示例)。TensorRT-LLM 支持 INT4 或 INT8 权重(以及 FP16 激活;即 INT4/INT8 仅权重)以及 SmoothQuant 技术的完整实现。

更详细的看下架构介绍吧,有些没来得及翻译,直接看原文更好:

TensorRT-LLM Architecture 架构

TensorRT-LLM 是一个工具集,用于组装优化的解决方案来执行大型语言模型(LLM)推理。提供了一个Python API来定义模型,并为NVIDIA GPU编译高效的TensorRT engine。它还包含Python和C++组件来构建运行时以执行这些引擎,以及Triton inference server的后端,方便地为LLM创建基于网络的服务。TensorRT-LLM支持多GPU和多节点配置(通过MPI)。

在拥有模型定义和权重后,用户必须使用TensorRT-LLM的Python API重新创建模型,这样可以由TensorRT编译成一个高效的引擎。为了方便使用,TensorRT-LLM已经支持了一些标准模型。

除了Python API描述模型外,TensorRT-LLM还为用户提供组件来创建运行高效TensorRT引擎的运行时。运行时组件提供beam search,以及诸如top-K和top-P采样之类的extensive sampling functionalities。

TensorRT-LLM还包括NVIDIA Triton推理服务器的Python和C++后端,以组装LLM在线服务的解决方案。

Model Definition 模型定义

如上所述,TensorRT-LLM具有一个Python API,可用于定义大型语言模型。此API基于功能强大的TensorRT Python API,在TensorRT中创建深度神经网络的图表示。也就是手搓模型。

在TensorRT-LLM中,tensorrt_llm.Builder 类包含一个tensorrt.Builder 对象。该实例用于tensorrt_llm.Builder.create_network方法中创建tensorrt.INetworkDefinition 类的实例。然后可以使用在tensorrt_llm.functional中定义的自由函数来填充INetworkDefinition对象。

其中一个简单的自由函数示例是tensorrt_llm.activation,它在模型的图中插入一个tensorrt.IActivationLayer 节点:

# 在 tensorrt_llm.functional 中:

def activation(input: Tensor, act_type: trt.ActivationType) -> Tensor:
    layer = default_trtnet().add_activation(input.trt_tensor, act_type)   # default_trtnet() -> INetworkDefinition
    return _create_tensor(layer.get_output(0), layer)

To make it even easier for users, a few of the most standard activation functions found in LLMs are derived from that function,有一些已经定义好的可以直接使用:

# In tensorrt_llm.functional:
relu = partial(activation, act_type=trt.ActivationType.RELU)
sigmoid = partial(activation, act_type=trt.ActivationType.SIGMOID)

Specialized activation functions can be used to assemble more advanced functions such as the silu activation 特殊的激活层可以直接通过api拼起来:


# In tensorrt_llm.functional:
def silu(input: Tensor) -> Tensor:
return input * sigmoid(input)

When the TensorRT-LLM’s Python API is utilized, a graph of the network is assembled. The graph can later be traversed or transformed using the graph traversal API exposed by thetensorrt.ILayerclass. That graph will also be optimized by TensorRT during the compilation ofthe engine, as explained in the next section.

使用python API拼好网络后,就可以进行构建了,和之前构建trt网络一样的流程。

Compilation

接下来是构建。

Once populated, the instance of the tensorrt.INetworkDefinition, can be compiled into an efficient engine by the tensorrt.Builder

In TensorRT-LLM, it is done through the build_engine member function of thetensorrt_llm.Builder class that calls the build_serialized_network method of the tensorrt.Builder object. That call, if everything works as expected, produces an instance of the tensorrt.IHostMemory class. That object is an optimized TensorRT engine that can be stored as a binary file.

Weight Bindings

在编译构建网络的时候必须要将权重放进去,也就是抽取出pytorch端的权重然后放到TensorRT-llm准备构建的网络中:

TensorRT engines embed the network weights, that must be known for compilation.
For that reason, the weights must be bound to parameters in the model definition before calling tensorrt_llm.Builder.build_engine. It leads to code like:


# The Linear operator exposes two parameters (see tensorrt_llm/layers/linear.py):
class Linear(Module):
def __init__(self, ...):
    self.weight = Parameter(shape=(self.out_features, self.in_features), dtype=dtype)
    self.bias = Parameter(shape=(self.out_features, ), dtype=dtype)
    # The parameters are bound to the weights before compiling the model. See examples/gpt/weight.py:
    tensorrt_llm_gpt.layers[i].mlp.fc.weight.value = fromfile(...)
    tensorrt_llm_gpt.layers[i].mlp.fc.bias.value = fromfile(...)

Note that TensorRT can also refit engines to update the weights after compilation. This feature is available to TensorRT-LLM users through the refit_engine method in the tensorrt_llm.Builder class.

Pattern-Matching and Fusion

TensorRT在编译网络图时执行的关键步骤之一是操作的融合。融合是一种提高执行LLM时效率的众所周知的技术。它有助于减少内存(DRAM)和计算核心(CUDA Core以及位于GPU流处理器上的Tensor Core)之间传输的数据量。它还消除了内核启动开销(每次在GPU上启动内核时,都会有一个称为启动开销的小额CPU成本)。一个经典的例子是激活函数与通常在网络中紧随其后的矩阵乘法(matmul)的融合。

In TensorRT-LLM, when defining the model, such a sequence can be written as:

c = tensorrt_llm.functional.matmul(a, b)
c = tensorrt_llm.functional.relu(c)

在推理过程中,如果上述序列在不融合的情况下执行,c张量必须在矩阵乘法结束时写入全局内存,然后在ReLU中从同一内存中读取,并在ReLU之后再次写入。如果在矩阵乘法和ReLU之间没有其他操作使用中间值,那么这是次优的。这就是为什么在编译过程中,TensorRT会识别该模式,并自动生成一个在矩阵乘法结束时应用ReLU的GPU内核,而无需通过全局内存进行中间步骤。通过这种优化,c张量仅在ReLU之后写入一次,而不是两次,并且在这两个操作之间不会被读取。

识别可以融合的操作序列的过程称为模式匹配。TensorRT拥有强大的模式匹配算法,可以识别许多可能的融合。所有识别出的模式都会由高级内核编译器转换为更高效的内核。

Plugins

The number of possible fusions is almost infinite and some useful fusions involve very advanced modifications of the graph. A well-known example is the Flash-Attention technique to optimize the Multihead-Attention block found in many LLMs. Flash-Attention requires modifications to the arithmetic performed in the sequence BMM-Softmax-BMM (where BMM stands for Batched Matrix-Matrix product) and the interleaving of the for-loops of the two batched matrix products. That’s non-trivial and not necessarily something you can expect a compiler to “discover” on its own (or it might require the support for a polyhedral model).

As a result, even if TensorRT has a powerful pattern-matching algorithm and supports a lot of possible fusions, there is always the risk that it cannot identify uncommon and/or very advanced patterns. To overcome that inevitable limitation, TensorRT offers a powerful mechanism known as plugins.

可能的融合数量几乎是无限的,一些有用的融合涉及对图的非常高级的修改。一个众所周知的例子是Flash-Attention技术,用于优化许多LLM中的Multihead-Attention块。Flash-Attention需要对序列BMM-Softmax-BMM(其中BMM代表批量矩阵-矩阵乘积)中执行的算术和两个批量矩阵乘积的for循环交错进行修改。这并不是一件简单的事情,也不一定是编译器可以自己“发现”的(或者可能需要多面体模型的支持)。

因此,即使TensorRT具有强大的模式匹配算法并支持许多可能的融合,仍然存在它无法识别不常见和/或非常高级模式的风险。为了克服这种不可避免的限制,TensorRT提供了一个强大的机制,称为插件。

The plugins are nodes inserted in the network graph definition that map to user-defined GPU kernels. TensorRT-LLM uses a number of such plugins. They can be found in the cpp/tensorrt_llm/plugins directory.

Plugins are written in C++ and follow a well-defined interface described in the Extending TensorRT with Custom Layers section of the TensorRT Developer Guide.

When executed within a TensorRT engine, plugins trigger the execution of their encapsulated GPU kernels. A fairly simple example of plugins is the QuantizeTensorPlugin that triggers a CUDA kernel in the QuantizeTensorPlugin::enqueue member function:


// In cpp/tensorrt_llm/plugins/quantizeTensorPlugin/quantizeTensorPlugin.cpp:
int QuantizeTensorPlugin::enqueue(...) {
    if (inputDesc[0].type == DataType::kFLOAT) {
    invokeQuantization<float>(...);
    } else {
    invokeQuantization<half>(...);
    }
    return 0;
}

// In cpp/tensorrt_llm/kernels/quantization.cu:
template <typename T>
    void invokeQuantization(...) {
    // The standard <<< >>> construct to launch CUDA kernels
    quantizedKernel<<<grid, block, 0, stream>>>(...);
}

For more details on how TensorRT-LLM implements the GPT Attention operator, see the Multihead and Multiquery Attention document.

Runtime

TensorRT-LLM includes an API to implement Python and C++ runtimes. The role of the runtime components is to load the TensorRT engines and drive their execution. Typically, for an auto-regressive model like GPT, the runtime is in charge of loading the engine that implements both the processing of the input sequence as well as the body of the generation loop. See the GPT C++ Runtime document for details on the C++ Runtime.

TensorRT-LLM 包括一个 API 来实现 Python 和 C++ 运行时。运行时组件的角色是加载 TensorRT 引擎并驱动其执行。通常,对于像 GPT 这样的自回归模型,运行时负责加载实现输入序列处理和生成循环主体的引擎。有关 C++ 运行时的详细信息,请参阅 GPT C++ 运行时文档。

和之前fastertransformer一样,C++中包含整个生成的循环,不仅仅是模型。

Multi-GPU and Multi-Node Support

Even if TensorRT is designed for single-GPU systems, TensorRT-LLM adds the support for systems with multiple GPUs and nodes. It is enabled using TensorRT plugins that wrap communication primitives from the NCCL library.

The communication plugins can be found in cpp/tensorrt_llm/plugins/ncclPlugin and the multi-GPU functions are exposed in the TensorRT-LLM Python API as:

即使TensorRT是为单GPU系统设计的,TensorRT-LLM添加了对具有多个GPU和节点的系统的支持。它是通过使用包装来自NCCL库的通信原语的TensorRT插件来启用的。

通信插件可以在cpp/tensorrt_llm/plugins/ncclPlugin中找到,多GPU功能在TensorRT-LLM Python API中以以下方式公开:

# In tensorrt_llm/functional.py:
# Collectives.
def allreduce(tensor: Tensor, group: List[int]) -> Tensor
def allgather(tensor: Tensor, group: List[int]) -> Tensor
# Point-to-point communication primitives.
def send(tensor: Tensor, tgt: int) -> Tensor
def recv(tensor: Tensor, src: int) -> Tensor

In-flight Batching

TensorRT-LLM supports in-flight batching of requests (also known as continuous batching or iteration-level batching) for higher serving throughput.

参考

TensorRT-LLM: A TensorRT toolbox for Large Language Models

The TensorRT-LLM Overview

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server.

Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism).

The Python API of TensorRT-LLM is architectured to look similar to the PyTorch API. It provides users with a functional module containing functions like einsum, softmax, matmul or view.

The layer module bundles useful building blocks to assemble LLMs; like an Attention block, a MLP or the entire Transformer layer. Model-specific components, like GPTAttention or BertAttention, can be found in the model module.

TensorRT-LLM provides users with predefined models that can easily be modified and extended. The current version of TensorRT-LLM supports BERT, GPT, NVIDIA GPT-2B, GPT-J, LLaMA, OPT, SantaCoder and StarCoder.

To maximize performance and reduce memory footprint, TensorRT-LLM allows the models to be executed using different quantization modes (see examples/gpt for concrete examples). TensorRT-LLM supports INT4 or INT8 weights (and FP16 activations; a.k.a. INT4/INT8 weight-only) as well as a complete implementation of the SmoothQuant technique.

For a more detailed presentation of the software architecture and the key concepts used in TensorRT-LLM, we recommend you to read the following document.

Release notes

Changelog

July 2023

  • TensorRT-LLM requires TensorRT 9.0,

  • Support for BLOOM, ChatGLM 6B, GPT-NeoX, LLaMA v2,

  • Support for BF16 and FP8 models,

  • Support for in-flight batching,

  • Support for a new C++ Triton Backend,

  • Refactoring of the KV cache to support pagging,

  • The KV cache is now decomposed into blocks,

  • The layout of the K cache has changed to [batch_size, num_heads, seq_length, dim_per_head],

  • Support for multi-GPU embeddings,

  • Support for embedding sharing (input embedding and LM head),

  • New example that shows how to integrate an OpenAI Triton kernel into TensorRT-LLM,

  • Improved documentation (Docstrings in functional.py and documentation in docs)

June 2023

  • Support Nemo-GPT Next, SantaCoder, StarCoder in FP16,

  • Support for a new C++ Runtime (with streaming support),

  • Support for beam-search,

  • Support for Multiquery Attention (MQA),

  • Support for RoPE,

  • Support for INT8 KV Cache,

  • Support INT4 weight-only (with GPT example), but the weight-only kernels will not be optimal on hopper

May 2023

  • The initial release of TensorRT-LLM

  • Support GPT, BERT, OPT, LLaMA in FP16,

  • Support single-node multi-GPU GPT, OPT, BERT, LLaMA FP16 using Tensor parallelism,

  • Support Triton Inference Server with a Python backend,

  • Support sampling features, including top-k, top-p, temperature, and sampling penalty,

  • Attention support

  • Optimized Flash-Attention-based Multihead Attention for Ampere, Ada and Hopper architectures,

  • Multi-Query Attention (MQA),

  • ALiBi in Multihead-Attention,

  • Support SmoothQuant INT8 (with GPT example),

  • Support INT8 weight-only (with GPT example), but the weight-only kernels will not be optimal on hopper