triton-inference-server相关信息

这里总结有关triton-inference-server相关的技术文章、最近更新等。

使用新版本triton需要注意显卡驱动 Frameworks Support Matrix - NVIDIA Docs

更新版本

官方release信息:

triton-inference-server的重要更新日志(可以用过更新的功能来选择需要的版本):

  • 23.01 custom batching strategies
  • 23.02 Support for ensemble models in Model Analyzer.
  • 23.04 Triton’s ragged batching support has been extended to PyTorch backend.
  • 23.05 Python backend supports Custom Metrics allowing users to define and report counters and gauges similar to the C API.
  • 23.06 The statistics extension now includes the memory usage of the loaded models This statistics is currently implemented only for TensorRT and ONNXRuntime backends.
    Added support for batch inputs in ragged batching for PyTorch backend.
  • 23.08 Python backend supports directly loading and serving PyTorch models with torch.compile().
  • 23.09 TensorRT backend now supports TensortRT version compatibility across models generated with the same major version of TensorRT. Use the --backend-config=tensorrt,version-compatible=true flag to enable this feature.
  • 23.11 The backend API has been enhanced to support rescheduling a request. Currently, only Python backend and Custom C++ backends support request rescheduling.

最新技术文章:

pytriton

triton tutorials

https://github.com/triton-inference-server/tutorials

triton新版本也提供了TMS-EA的申请:

This release of the management service is the pre-release version under early-access program. It is considered alpha quality software and not recommended for production deployment. For instance, security features such as TLS are not supported at the moment.

New releases happen every month. Currently supported functionalities for alpha release include:

  • Automates deploying and managing Triton on Kubernetes (k8s) with requested models
  • Avoids unnecessary Triton Inference Server instances by loading models onto already running Triton instances when possible.
  • Enables more efficient GPU utilization by allowing multiple models to share the same Triton instance in a single pod.
  • Unloads models when not in use
  • Groups models from different frameworks together to ensure they coexist efficiently without out of memory issues
  • Allows for loading models from multiple sources such as secure registry, HTTPS, etc
  • Allows custom resource allocation per model or a set of models
  • REST and JSON gPRC service

申请地址:https://developer.nvidia.com/tms-early-access