triton-server中的ragged-batching

imoldpan · 2023 年7 月 3 日 07:35

Triton提供了动态批处理功能，它可以将多个相同模型执行的请求合并以提供更大的吞吐量。默认情况下，只有在每个请求中的输入具有相同形状时，才能进行动态批处理。为了利用动态批处理来处理输入形状经常变化的情况，客户端需要将请求中的输入张量填充到相同的形状。

Ragged batching是一种避免显式填充的功能，它允许用户指定哪些输入不需要进行形状检查。用户可以通过在模型配置中设置allow_ragged_batch字段来指定这样的输入（不规则输入）：

input [
  {
    name: "input0"
    data_type: TYPE_FP32
    dims: [ 16 ]
    allow_ragged_batch: true
  }
]

How ragged input are processed in a batch of requests depends on the backend implementation. The backends, such as ONNX Runtime backend, TensorFlow backend, PyTorch backend, and TensorRT backend, require models to accept ragged inputs as 1-dimensional tensors. These backends concatenates the request inputs into the 1-dimensional tensor.

Because the concatenated input doesn’t track the start and end index for each request, the backends often require the model to have additional input(s), batch input, that describe various information about the batch formed.

参考

server/docs/user_guide/ragged_batching.md at main · triton-inference-server/server · GitHub