TensorRT-LLM推理细节

batch 不可以缓解 memory bound

增大batch在decode的时候是这样的,他里面有gemm,norm,attention三类算子

gemm会随着batch变大,逐渐变成计算型的算子

剩下两个不会

Hey, that value is not implemented today in the code, and is hard coded to 0. This does not mean IFB is not active. We’ll try to
I would suggest removing dynamic batching and preferred_batch_size from the triton config. If you’d like, you can inspect per-iteration statistics (https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#triton-metrics is probably easiest), which will tell you how many prompt and generation requests are in each iteration. Having > 0 of both in any iteration is conclusive evidence of IFB working.

vllm

vllm的decode部分 有很多 bucket 用来使用cuda -graph