Improving GPU Performance by Reducing Instruction Cache Misses

imoldpan · 2024 年9 月 1 日 12:43

GPU 专为高速处理大量数据而设计。它们拥有大量的计算资源，称为流式多处理器（SM），以及一系列保证数据供应的设施：高带宽内存、较大的数据缓存，以及在活跃线程组（warp）数据耗尽时可以无延迟切换到其他线程组的能力。

GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming multiprocessors (SMs), and an array of facilities to keep them fed with data: high bandwidth to memory, sizable data caches, and the capability to switch to other teams of workers (warps) without any overhead if an active team has run out of data.

然而，即使在这种情况下，也可能会发生数据供应不足的问题，代码优化的大部分工作集中在解决这一问题。在某些情况下，SMs 缺的不是数据，而是指令。本文讨论了一个因指令缓存未命中的 GPU 工作负载引发的性能下降，介绍了如何识别这一瓶颈，以及如何消除它以提高性能。

Yet data starvation may still occur, and much of code optimization focuses on that issue. In some cases, ‌SMs are starved not for data, but for instructions. This post presents an investigation of a GPU workload that experiences a slowdown due to instruction cache misses. It describes how to identify this bottleneck, as well as techniques for removing it to improve performance.

Recognizing the problem

此次调查的起点是一个基因组学领域的应用程序，该应用程序涉及解决多个独立的小问题，这些问题与将 DNA 样本的小片段与参考基因组对齐有关。这个背景是众所周知的Smith-Waterman 算法（但这本身并不影响本文的讨论）。

该程序在强大的 NVIDIA H100 Hopper GPU（拥有 114 个 SM）上运行一个中等规模的数据集时，表现出很好的前景。NVIDIA Nsight Compute 工具分析了程序在 GPU 上的执行情况，确认 SMs 忙于有用的计算，但有一个问题。

组成整个工作负载的许多小问题——每个问题由一个线程处理——可以同时在 GPU 上运行，但并非所有计算资源在任何时候都被完全利用。这表现为波数的一个小数值。

GPU 的工作被分成称为线程块的任务，一个或多个线程块可以驻留在一个 SM 上。如果某些 SMs 获得的线程块比其他 SMs 少，它们将耗尽工作并进入空闲状态，而其他 SMs 则继续工作。

将所有 SM 完全填满线程块被称为一个波。NVIDIA Nsight Compute 工具会报告每个 SM 的波数。如果该数值为 100.5，则意味着并非所有 SMs 都有相同的工作量，有些不得不进入空闲状态。然而，不均匀分配的影响并不显著。

大多数情况下，SMs 的负载是均衡的。但当波数仅为 0.5 时，例如，情况就会改变。在很大一部分时间里，SMs 会遇到工作分配不均的现象，这被称为“尾效应”。

The origin of this investigation is an application from the domain of genomics in which many small, independent problems related to aligning small sections of a DNA sample with a reference genome must be solved. The background is the well-known Smith-Waterman algorithm (but that by itself is not important for the discussion).

Running the program on a medium-sized dataset on a powerful NVIDIA H100 Hopper GPU, with 114 SMs, showed good promise. The NVIDIA Nsight Compute tool, which analyzes a program’s execution on the GPU, confirmed that the SMs were quite busy with useful computations, but there was a snag.

So many of the small problems composing the overall workload—each handled by its own thread—could be run on the GPU simultaneously that not all compute resources were fully used all the time. This is expressed as a small and non-integral number of waves.

Work for the GPU is divided into chunks called thread blocks, and one or more can reside on an SM. If some SMs receive fewer thread blocks than others, they will run out of work and must idle while the other SMs continue working.

Filling up all SMs completely with thread blocks constitutes one wave. NVIDIA Nsight Compute dutifully reports the number of waves per SM. If that number happens to be 100.5, it means that not all SMs have the same amount of work to do and that some are forced to idle. However, the impact of the uneven distribution is not substantial.

Most of the time, the load on the SMs is balanced. That situation changes if the number of waves is just 0.5, for example. For a much larger percentage of the time, SMs experience an uneven work distribution, which is called the tail effect.

Addressing the tail effect

这种现象正是基因组学工作负载中出现的问题。波数仅为 1.6。显而易见的解决方案是给 GPU 更多的工作（更多线程，进而更多的 32 线程的 warps），这通常不是问题。

This phenomenon is exactly what materialized with the genomics workload. The number of waves was just 1.6. The obvious solution is to give the GPU more work to do (more threads, leading to more warps of 32 threads each), which is usually not a problem.

原始工作负载相对较小，在实际环境中，必须完成更大的问题。然而，通过将子问题数量加倍、三倍和四倍，性能不升反降。是什么导致了这个结果？

The original workload was relatively modest, and in a practical environment, larger problems must be completed. However, increasing the original workload by doubling, tripling, and quadrupling the number of subproblems saw performance deteriorate rather than improve. What could cause this outcome?

NVIDIA Nsight Compute 报告显示了四个工作负载大小下的情况。在Warp 状态部分（列出了线程无法前进的原因），无指令的值显著随着工作负载的增加而增长（图 1）。

The combined NVIDIA Nsight Compute report of those four workload sizes sheds light on the situation. In the section called Warp State, which lists the reasons threads cannot make progress, the value for No Instruction stands out for increasing significantly with workload size (Figure 1).

Screenshot of of bar chart, where the value for “No Instruction” is causing the most stalls.

Figure 1. Warp stall reasons for four workload sizes from the combined NVIDIA Nsight Compute report

无指令意味着 SMs 无法从内存中及时获取指令。长记分板表示 SMs 无法及时从内存中获取数据。指令获取的时效性至关重要，因此 GPU 提供了一些站点来存放已获取的指令，以便将它们保留在 SM 附近。这些站点称为指令缓存，而且它们的层次比数据缓存还多。

指令缓存未命中显著增加，表明以下几点：

• 代码中最繁忙部分的所有指令都不能适应缓存。

• 随着工作负载大小的增加，不同指令的需求也增加。

后者的原因较为微妙。多个由 warps 组成的线程块同时驻留在 SM 上，但并非所有 warps 都同时执行。SM 内部分为四个分区，每个分区通常每个时钟周期执行一个 warp 指令。

当一个 warp 因任何原因停滞时，驻留在同一 SM 上的另一个 warp 可以接替。每个 warp 可以独立于其他 warps 执行自己的指令流。

在该程序的主内核启动时，每个 SM 上运行的 warps 大多是同步的。它们从第一条指令开始，并一路执行。然而，它们并未显式同步。

随着时间推移，warps 轮流进入空闲和执行状态，它们在执行的指令上逐渐脱离同步。这意味着随着执行进程的发展，需要执行的不同指令集越来越多，这反过来意味着指令缓存溢出越来越频繁，指令缓存压力增加，未命中率升高。

No Instruction means the SMs could not be fed instructions fast enough from memory. Long Scoreboard indicates that the SMs could not be fed data fast enough from memory. Fetching instructions on time is so critical that the GPU provides a number of stations where instructions can be placed after they have been fetched to keep them close to the SMs. Those stations are called instruction caches, and there are even more levels of them than data caches.

The fact that instruction cache misses apparently increase so quickly, as evidenced by the quick growth of warp stalls due to No Instruction, implies the following:

Not all instructions in the busiest part of the code fit in that cache.
The need for more different instructions increases as the workload size increases.

The reason for the latter is somewhat subtle. Multiple thread blocks, composed of warps, reside on the SM simultaneously, but not all warps execute simultaneously. The SM is internally divided into four partitions, each of which can generally execute one warp instruction per clock cycle.

When a warp is stalled for any reason, another warp that also resides on the SM can take over. Each warp can execute its own instruction stream independently from other warps.

At the start of the main kernel in this program, warps running on each SM are mostly in sync. They start with the first instruction and keep chugging along. However, they are not explicitly synchronizing.

As time goes on and warps take turns idling and executing, they drift further and further apart in terms of the instructions that they execute. This means that a growing set of different instructions must be active as the execution progresses, and this, in turn, means that the instruction cache overflows more frequently. Instruction cache pressure builds, and more misses occur.

Solving the problem

The gradual drifting apart of the warp instruction streams cannot be controlled, except by synchronizing the streams. But synchronization typically reduces performance, because it requires warps to wait for each other when there is no fundamental need.

However, you can attempt to decrease the overall instruction footprint so that spilling from the instruction cache occurs less frequently, and perhaps not at all.

The code in question contains a collection of nested loops, and most of the loops are unrolled. Unrolling improves performance by enabling the compiler to do the following:

Reorder (independent) instructions for better scheduling.
Eliminate some instructions that can be shared by successive iterations of the loop.
Reduce branching.
Allocate different registers to the same variable referenced in different iterations of the loop, to avoid having to wait for specific registers to become available.

Unrolling loops brings many benefits, but it does increase the number of instructions. It also tends to increase the number of registers used, which may depress performance, because fewer warps may reside on the SMs simultaneously. This reduced warp occupancy comes with less latency hiding.

The two outermost loops of the kernel are the focus. Practical unrolling is best left to the compiler, which has myriad heuristics to generate good code. That is, the user expresses that there is an expected benefit of unrolling by using hints, called pragmas in C/C++, before the top of the loop.

These take the following form:

#pragma unroll X

Where X can be blank (canonical unroll), the compiler is only told that unrolling may be beneficial but is not given any suggestion of how many iterations to unroll.

For convenience, we’ve adopted the following notation for unroll factors:

0 = No unroll pragma at all.
1 = An unroll pragma without any number (canonical).
n larger than 1 = A positive number that suggests unrolling in groups of n iterations.

#pragma unroll (n)

The next experiment comprises a suite of runs in which the unroll factor varies between 0 and 4 for both levels of the two outermost loops in the code, producing a performance figure for each of the four workload sizes. Unrolling by more is not needed, because experiments show that the compiler does not generate different code for higher unroll factors for this particular program. Figure 2 shows the outcome of the suite.

A graph plotting code performance for each of the workload sizes. For each instance of unroll factors, the size of the executable is shown in units of 500 KB.

Figure 2. Performance of Smith-Waterman code for different workload sizes and different loop unroll factors

The top horizontal axis shows unroll factors for the outermost loop (top-level). The bottom horizontal axis shows unroll factors for the second-level loop. Each point on any of the four performance curves (higher is better) corresponds to two unroll factors, one for each of the outermost loops as indicated on the horizontal axes.

Figure 2 also shows, for each instance of unroll factors, the size of the executable in units of 500 KB. While the expectation might be to see increasing executable sizes with each higher level of unrolling, that is not consistently the case. Unroll pragmas are hints that may be ignored by the compiler if they are not deemed beneficial.

Measurements corresponding to the initial version of the code (indicated by the ellipse labeled A) are for the canonical unrolling of the top-level loop, and no unrolling of the second-level loop. The anomalous behavior of the code is apparent, where larger workload sizes lead to poorer performance, due to increasing instruction cache misses.

In the next isolated experiment (indicated by the ellipse labeled B), attempted before the full suite of runs, neither of the outermost loops is unrolled. Now the anomalous behavior is gone and larger workload sizes lead to the expected better performance.

However, absolute performance is reduced, especially for the original workload size. Two phenomena, revealed by NVIDIA Nsight Compute, help explain this result. Due to the smaller instruction memory footprint, instruction cache misses have reduced for all sizes of the workload, which can be deduced from the fact that the No Instruction warp stalls (not depicted) have dropped to virtually negligible values. However, the compiler assigned a relatively large number of registers to each thread, such that the number of warps that can reside on the SM is not optimal.

Doing the full sweep of unroll factors suggests that the experiment in the ellipse labeled C is the proverbial sweet spot. It corresponds to no unrolling of the top-level loop, and unrolling by a factor of 2 of the second-level loop. NVIDIA Nsight Compute still shows negligible values for No Instruction warp stalls (Figure 3) and a reduced number of registers per thread, such that more warps can fit on the SM than in experiment B, leading to more latency hiding.

Bar chart of the Warp State graph, with negligible values for No Instruction warp stalls.

Figure 3. Warp stall reasons for four workload sizes from the combined NVIDIA Nsight Compute report for optimal unroll factors

While the absolute performance of the smallest workload still lags behind that of experiment A, the difference is not much, and larger workloads fare increasingly better, leading to the best average performance across all workload sizes.

Further inspection of the NVIDIA Nsight Compute reports for the three different unrolling scenarios (A, B, and C) elucidates the performance results.

Total instruction memory footprint sizes, as shown by the dashed line in Figure 2, are not accurate measures of instruction cache pressure, because they may include code sections that are only executed a small number of times. It is better to study the aggregate sizes of the “hottest” parts of the code, which can be identified by looking for the maximum values of the Instructions Executed metric in the source view of NVIDIA Nsight Compute.

For scenarios A, B, and C these sizes are 39360, 15680, and 16912, respectively. Clearly, scenarios B and C have much reduced hot instruction memory footprints compared to scenario A, leading to less instruction cache pressure.

Conclusion

Instruction cache misses can cause performance degradation for kernels that have a large instruction footprint, which is often caused by substantial loop unrolling. When the compiler is put in charge of unrolling through pragmas, the heuristics it applies to the code to determine the best actual level of unrolling are necessarily complicated and are not always predictable by the programmer.

It may pay to experiment with different compiler hints regarding loop unrolling to arrive at the optimal code with good warp occupancy and reduced instruction cache misses.

Get started with Nsight Compute today. For more information and tutorials, see Nsight Developer Tools Tutorials.

参考

https://developer.nvidia.com/blog/improving-gpu-performance-by-reducing-instruction-cache-misses-2/