NVIDIA-ON-DEMAND

这里总结每年和部署相关的GTC相关内容。

2023年的GTC录像和ppt在4月10号之后提供。
Measure Right! Best Practices when Benchmarking CUDA Applications [S51334]

TensorRT 8.6: Hardware & Version Compatibility [S51656]

We’ll introduce the latest version of TensorRT, 8.6, and deep dive into its two key features, Hardware & Version compatibility. These new features of TensorRT allow engines generated on older hardware to continue to work on newer hardware and allow users to create serialized TensorRT engines with one version of TensorRT, and de-serialize them with a newer version of TensorRT.

An End-to-End Subgraph Optimization Framework Based on TensorRT [S51416]

NVIDIA TensorRT is a high-performance SDK for optimizing the inference performance of general deep learning-based AI models. However, there are some non-general subgraphs in the industrial models, which are hard for TensorRT to optimize directly. To optimize these non-general subgraphs, we’ve designed an end-to-end framework that uses an AI compiler. Specifically, this framework automatically analyzes and crops subgraphs that have performance bottlenecks in the ONNX-Graph, and uses an AI compiler to optimize them and generate code to fill into the TensorRT plugin. Finally, our framework uses the optimized ONNX Graph to build the TensorRT engine. We’re here to help engineers with model inference optimization backgrounds to conduct secondary development based on TensorRT and achieve higher inference performance.

NVIDIA TensorRT是一个高性能的SDK,用于优化基于深度学习的AI模型的推理性能。然而,在工业模型中存在一些非通用子图,这些子图很难直接由TensorRT进行优化。为了优化这些非通用子图,我们设计了一个端到端框架,使用AI编译器。具体来说,该框架自动分析和裁剪在ONNX-Graph中存在性能瓶颈的子图,并使用AI编译器对其进行优化并生成代码以填充到TensorRT插件中。最后,我们的框架使用经过优化的ONNX Graph构建TensorRT引擎。我们旨在帮助具有模型推理优化背景的工程师基于TensorRT进行二次开发,并实现更高效率的推理性能。

Debugging CUDA: An Overview of CUDA Correctness Tools [S51772]

Debugging CUDA programs presents unique challenges for the software developer. To address these challenges, NVIDIA offers a wide range of tools to assist in the debugging process. First, we’ll explore the features and capabilities of three CUDA debugging tools (CUDA GDB, Nsight Visual Studio Edition, and Nsight Visual Studio Code Edition) and demonstrate how each can be used to debug CUDA programs. We’ll also demonstrate how Nsight Visual Studio Code Edition can be extended to support CUDA development, using Visual Studio Code extensions to provide AI-assisted authoring of CUDA programs. Then we’ll explore the capabilities of Compute Sanitizer, a functional correctness tool that can be used to identify multiple types of logic errors in CUDA programs.

Robust and Efficient CUDA C++ Concurrency with Stream-Ordered Allocation [S51897]

该文介绍了使用流排序内存分配进行CUDA C ++并发的方法和技巧,以提高GPU应用程序的性能。文中提到了RAPIDS cuDF和RAPIDS内存管理器的使用,介绍了现代CUDA C ++并发和流感知的CUDA C ++数据容器的使用方法,以及流排序执行和内存分配的关键概念和语义。此外,还分享了有效且安全的流排序API设计的指南。前提是有CUDA编程经验和CUDA流的基本熟悉程度。

Measure Right! Best Practices when Benchmarking CUDA Applications [S51334]

Measuring performance in a deterministic and reproducible way is difficult. It’s particularly challenging on GPU-accelerated heterogeneous systems in which complex interactions among CPUs, GPUs, the memory subsystem, the OS, and many other factors need to be properly addressed. We’ll explain how to configure a system and “gotchas” to avoid when benchmarking CUDA applications. We’ll cover topics such as power management, system topology, NUMA-awareness, thread affinity, OS thread scheduling, CUDA JIT caches, and more.

Exploring Next-Generation Methods for Optimizing PyTorch Models for Inference with Torch-TensorRT [S51714]

The PyTorch optimization and deployment ecosystem on NVIDIA GPUs is constantly evolving. Most recently, a new deployment workflow has matured in PyTorch centered around torch.fx and TorchDynamo. TorchDynamo is the next-generation machine learning compiler in PyTorch. This new method of deploying PyTorch allows for easy and accurate tracing (relative to TorchScript) and modification of the source model completely in Python. Torch-TensorRT is making FX + Dynamo a first-class workflow for users seeking to optimize their PyTorch models with TensorRT. We’ll dive into work we’re doing today toward this goal, showing what we’ve done to support this new stack. Finally, we’ll demonstrate how you can start experimenting with FX, Dynamo, and TensorRT today to get a preview of the direction Torch-TensorRT is headed.

Become Faster in Writing Performant CUDA Kernels using the Source Page in Nsight Compute [S51882]

Optimizing the performance of CUDA kernel code is typically by itself a time-constrained effort. Learn how to make the most of the Source Page in Nsight Compute to quickly pinpoint and resolve bottlenecks in your CUDA kernels. We’ll discuss best practices to efficiently navigate the source views, how to utilize code correlation to understand the behavior of the compiler-generated code, and take a detailed look at the metrics that are available per individual source line.

Magnus Strengert, Software Engineering Manager, NVIDIA

Industry: All Industries

Topic: Accelerated Computing & Dev Tools - Profilers / Debuggers / Code Analysis

Accelerated COVID-19 CT Image Enhancement via Sparse Tensor Cores [PS51124]

本文介绍了一种名为DD-Net的深度学习模型,用于利用稀疏技术提高COVID-19胸部扫描的CT图像。该模型采用自编码器-解码器体系结构,在训练时需要大量计算小时数。作者提出一组针对维度和训练时间的技术,包括修剪神经元、结构化修剪和混合精度训练等,以实现更好的硬件利用和减少训练成本。在取得不超过5%的准确率损失的情况下,使用这些技术可以将模型的训练速度提高1.9倍。

CUDA Graphs 101 [S51211]

CUDA Graphs是一种异步执行模型,可以缩短将任务分派到GPU的时间,提高GPU运行时效率。CUDA Graphs允许用户在提交工作给GPU之前事先定义整个工作流程,从而在无法立即提交CUDA Streams时,优化工作流程。本文将介绍CUDA Graphs的基本工作原理、如何将Graphs添加到现有应用程序以及CUDA 12.0中的新功能。此外,前提是读者已熟悉基本的CUDA知识。