与triton-inference-server相关的一些部署细节

模型相关的CUDA技术

CUDA lazy loading

我测试了一下

export CUDA_MODULE_LOADING=LAZY
root@a484578ca5f8:/workspace# nvidia-smi
Thu Dec 22 12:29:51 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:01:00.0 Off |                  Off |
| 41%   34C    P8    16W / 140W |    469MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|

从670MB降到469MB,请求了几次稳定到471MB,显存

注意事项

  • Lazy Loading is a CUDA Runtime and CUDA Driver feature.
  • Lazy Loading was introduced in CUDA 11.7, and received a significant upgrade in CUDA 11.8.
  • As CUDA Runtime is usually linked statically into programs and libraries, this means that you have to recompile your program with CUDA 11.7+ toolkit and use CUDA 11.7+ libraries.