New features in TensorRT include multi-GPU multi-node inference, performance and hardware optimizations, and more.

Multi-GPU multi-node inference

TensorRT can be used to run multi-GPU multi-node inference for large language models (LLMs). It supports GPT-3 175B, 530B, and 6.7B models. These models do not require ONNX conversion; rather, a simple Python API is available to optimize for multi-GPU inference. Now available in private early access. Contact your NVIDIA account team for more details.

TensorRT 8.6

TensorRT 8.6 is now available in early access and includes the following key features:

  • Performance optimizations for generative AI diffusion and transformer models
  • Hardware compatibility to build and run on different GPU architectures (NVIDIA Ampere architecture and later)
  • Version compatibility to build and run on different TensorRT versions (TensorRT 8.6 and later)
  • Optimization levels to trade between build time and inference performance