NVTX使用方法指北

性能优化相关,主要是为了监控模型性能

Python端

The nvtx library in Python is part of the NVIDIA Tools Extension (NVTX) and is primarily used for annotating code, especially when profiling GPU-accelerated applications. By marking specific sections of your code, you can gain insights into performance bottlenecks when used with NVIDIA Nsight Systems or other profiling tools.

Basic Usage

  1. Installation:
    Install the NVTX library using pip:

    pip install nvtx
    
  2. Annotating Code:
    You can use the annotate() function in two primary ways:

    • As a decorator:

      @nvtx.annotate(message="my_function", color="blue")
      def my_function():
          # Your code here
      
    • As a context manager:

      with nvtx.annotate(message="my_loop", color="green"):
          for i in range(10):
              # Your code here
      
  3. Markers:
    To mark an instantaneous event, you can use the mark() function. This is useful for highlighting specific events in your execution flow:

    nvtx.mark(message="start_event", color="red")
    
  4. Using Ranges:
    For more complex cases, such as annotating across multiple functions or asynchronous code, you can use start_range() and end_range():

    rng = nvtx.start_range(message="start_range", color="blue")
    # Your code here
    nvtx.end_range(rng)
    
  5. Domains and Categories:
    You can group annotations using domains and categories to help better organize and filter your profiling data. Domains allow you to scope annotations to specific parts of your codebase, while categories allow for finer classification within those domains.

  6. Visualization:
    Once you’ve annotated your code, run it with a tool like NVIDIA Nsight Systems for profiling:

    nsys profile python your_script.py
    

This generates a detailed timeline of the program’s execution, including the annotated sections. You can visualize this using the Nsight Systems GUI to identify performance hotspots and optimize your code accordingly.

These tools are helpful in identifying memory allocation times, GPU kernel execution times, and general bottlenecks in GPU-accelerated tasks【6†source】【7†source】【9†source】.

参考

1 个赞