大模型中的量化

量化方法

  • per tensor
  • per token
  • per channel

在大模型中,

However, per-channel activation quantization does not map well to hardware-accelerated GEMM kernels, that rely on a sequence of operations executed at a high throughput (e.g., Tensor Core MMAs) and do not tolerate the insertion of instructions with a lower throughput (e.g., conversions or CUDA Core FMAs) in that sequence. In those kernels, scaling can only be performed along the outer dimensions of the matrix multiplication (i.e., token dimension of activations T , output channel dimension of weights Co), which can be applied after the matrix multiplication finishes:
𝑌=𝑑𝑖𝑎𝑔(∆_{𝐹𝑃16}^𝑋)·(𝑥^{𝐼𝑁𝑇8}·𝑊^{𝐼𝑁𝑇8})·𝑑𝑖𝑎𝑔(∆_{𝑊}^{𝐹𝑃16})

参考