NVIDIA-GPU程序相关profile方法、库总览

imoldpan · 2023 年4 月 6 日 16:03

总览

NVIDIA相关的profile库：

DLProf Release Notes - NVIDIA Docs

相关库

The Deep Learning Profiler (DLProf)

NVTX

NVTX是一个基于C语言的API，可用于应用程序中的事件、代码范围和资源的注释。通过集成NVTX，可以使用NVIDIA Nsight、Tegra System Profiler和Visual Profiler等工具捕获和可视化这些事件和范围。NVTX SDK可快速集成到应用程序中。该SDK为NVIDIA的工具添加了附加值，几乎没有任何开销。NVTX提供跟踪CPU事件和时间范围以及命名操作系统和API资源的两个核心服务。NVTX v3是一个头文件库，头文件随CUDA Toolkit和Nsight工具一起发布。

By default, NVTX API calls do nothing. When you launch a program from a developer tool, NVTX calls in that program are redirected to functions in the tool. Developer tools are free to implement NVTX API calls however they wish.

Here are some examples of what a tool might do with NVTX calls:

Print a message to the console
Record a trace of when NVTX calls occur, and display them on a timeline
Build a statistical profile of NVTX calls, or time spent in ranges between calls
Enable/disable tool features in ranges bounded by NVTX calls matching some criteria
Forward the data to other logging APIs or event systems

CUPIT

NVIDIA® CUDA Profiling Tools Interface (CUPTI) 是一种动态库，可用于创建诊断针对 CUDA 应用程序的工具。一个软件开发人员可以使用 CUPTI API 在目标系统上创建性能优化工具，它提供了多个API。通过使用这些 API，可以在目标系统上提供低延迟和确定性的分析性能工具，并了解 CUDA 应用程序的 CPU 和 GPU 行为。CUPTI 是 CUDA 工具包的一部分，更新和有新版本发布时也会在 Nvidia 网站上发布。CUPTI 版本 12.1 发布支持的平台有 Linux x86_64、Windows x86_64，以及一些 NVIDIA GPU 架构。

Nsight Systems

常用的一些flag

Profile方法

for _ in range(NITER):
    trt_model(random_data)
    torch.cuda.synchronize()

def synchronize(device: _device_t = None) -> None:
    r"""Waits for all kernels in all streams on a CUDA device to complete.

    Args:
        device (torch.device or int, optional): device for which to synchronize.
            It uses the current device, given by :func:`~torch.cuda.current_device`,
            if :attr:`device` is ``None`` (default).
    """
    _lazy_init()
    with torch.cuda.device(device):
        return torch._C._cuda_synchronize()

在C++中测试

#include <chrono>

auto startTime = std::chrono::high_resolution_clock::now();
context->enqueueV2(&buffers[0], stream, nullptr);
cudaStreamSynchronize(stream);
auto endTime = std::chrono::high_resolution_clock::now();
float totalTime = std::chrono::duration<float, std::milli>(endTime - startTime).count();

注意两点：

很简单的测量方式，因为cuda为异步所以需要加上同步cudaStreamSynchronize 函数。
但是如果想要测的更精准，尽可能把host和device的同步操作时间干掉，在那种多模型运行的环境，同步操作多多少少会影响评测精度。

如果想要更精准：

cudaEvent_t start, end;
cudaEventCreate(&start);
cudaEventCreate(&end);
cudaEventRecord(start, stream);
context->enqueueV2(&buffers[0], stream, nullptr);
cudaEventRecord(end, stream);
cudaEventSynchronize(end);
float totalTime;
cudaEventElapsedTime(&totalTime, start, end);