New features in TensorRT include multi-GPU multi-node inference, performance and hardware optimizations, and more.
TensorRT can be used to run multi-GPU multi-node inference for large language models (LLMs). It supports GPT-3 175B, 530B, and 6.7B models. These models do not require ONNX conversion; rather, a simple Python API is available to optimize for multi-GPU inference. Now available in private early access. Contact your NVIDIA account team for more details.
TensorRT 8.6 is now available in early access and includes the following key features:
- Performance optimizations for generative AI diffusion and transformer models
- Hardware compatibility to build and run on different GPU architectures (NVIDIA Ampere architecture and later)
- Version compatibility to build and run on different TensorRT versions (TensorRT 8.6 and later)
- Optimization levels to trade between build time and inference performance