说明
收集部署相关开源仓库中有参考价值的问与答,附带相关issue和回答。
Inference results differ for diffrent batch size
这个链接是关于在 GitHub 上的 Triton Inference Server 项目中的一个问题讨论。问题编号为 #5640 ,标题为 “Inference results differ for different batch size”,由用户 casperroo 在 2023 年 4 月 14 日提出。
问题的主要内容是:用户发现在使用 Triton Inference Server 进行图像分类(使用 efficientnet 模型)时,对于相同的输入数据,不同的批处理大小会产生不同的结果。虽然差异很小,但用户想知道这是否是预期的行为,或者是否存在某些批处理数据在管道中被重叠的情况。
项目的贡献者 kthui 和 oandreeva-nv 回应说,Triton 不会改变批处理大小的输出,这可能是模型对不同批处理大小的行为差异。另一位贡献者 rmccorm4 也确认,当在 GPU 上执行时,不同的批处理大小通常会导致结果的轻微差异,这通常是由于基于批处理大小选择不同的 CUDA 内核所导致的。
casperroo 用户在了解到这些信息后,感谢了项目的贡献者,并表示他们已经指明了正确的方向。他指出,“确定性”是他一直忽视的关键词,TensorFlow 提供了一些解决方案,但总的来说,这个问题的解释非常好:“这些差异通常是由于在操作中使用异步线程非确定性地改变浮点数添加的顺序所引起的。大多数这种非确定性的情况发生在 GPU 上,GPU 有数千个硬件线程用于运行操作。”他表示,一旦你读到这个,这实际上是非常明显的:执行浮点操作的顺序会产生差异。
These differences are often caused by the use of asynchronous threads within the op nondeterministically changing the order in which floating-point numbers are added. Most of these cases of nondeterminism occur on GPUs, which have thousands of hardware threads that are used to run ops
opened 08:42AM - 14 Apr 23 UTC
closed 07:43AM - 18 Apr 23 UTC
question
Software version:
- Triton Inference Server 23.03 (build 56086596) - from dock… er
- image_client from c++ client examples (built on host from 23.03 tag)
In short:
Is this expected behavior of the Triton Inference Server (or underlying backend) to yield different results for different batch sizes for the very same input data?
I am testing image classification with efficientnet.
The differences are minimal but I wonder if this is expected or some batched data gets overlapped somewhere in the pipeline?
My test:
I start the triton inference server and serve a model (efficientnet) with the following config:
```
name: "mynet"
platform: "tensorflow_savedmodel"
max_batch_size: 20
input [
{
name: "image_input"
data_type: TYPE_FP32
format: FORMAT_NHWC
dims: [ 300, 300, 3 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [ 13 ]
label_filename: "labels.txt"
}
]
instance_group [
{
count: 4
kind: KIND_GPU
}
]
optimization {
execution_accelerators {
gpu_execution_accelerator : [
{
name : "auto_mixed_precision"
}
]
}}
```
I run the client with batch size of 1:
```
./examples/image_client -m mynet -u 127.0.0.1:8001 -i gRPC -c 3 -b 1 /tmp/random
Request 0, batch size 1
Image '/tmp/random/1.png':
0.978062 (11) = l12
0.012779 (8) = l9
0.007350 (12) = l13
Request 1, batch size 1
Image '/tmp/random/2.png':
0.990088 (11) = l12
0.010908 (8) = l9
0.008616 (12) = l13
```
With batch size of 2:
```
./examples/image_client -m mynet -u 127.0.0.1:8001 -i gRPC -c 3 -b 2 /tmp/random
Request 0, batch size 2
Image '/tmp/random/1.png':
0.978104 (11) = l12
0.012779 (8) = l9
0.007350 (12) = l13
Image '/tmp/random/2.png':
0.990011 (11) = l12
0.010951 (8) = l9
0.008583 (12) = l13
```
With batch size of 10:
```
./examples/image_client -m mynet -u 127.0.0.1:8001 -i gRPC -c 3 -b 10 /tmp/random
Request 0, batch size 10
Image '/tmp/random/1.png':
0.977936 (11) = l12
0.012730 (8) = l9
0.007379 (12) = l13
Image '/tmp/random/2.png':
0.990011 (11) = l12
0.010866 (8) = l9
0.008616 (12) = l13
(...)
```
So 1.png depending on batch size yields the following score for l12:
```
- Batch size 1: 0.978062
- Batch size 2: 0.978104
- Batch size 10: 0.977936
```
I get the same results when I run it from my client with cuda shared memory.
Disabling the gpu_execution_accelerators doesn't change the behavior.