CUDA中的thrust库

Thrust是一个基于标准模板库(STL)的CUDA C++模板库。通过完全与CUDA C互操作的高级接口,Thrust允许您以最小的编程工作量实现高性能并行应用程序。

Thrust提供了丰富的数据并行原语,如扫描、排序和归约等,这些原语可以组合在一起使用,以简洁、易读的源代码实现复杂算法。通过使用这些高级抽象来描述计算过程,您为Thrust提供了自动选择最有效实现的自由。因此,在CUDA应用程序快速原型开发中最重要的是程序员生产力方面以及在生产环境中至关重要的健壮性和绝对性能方面都可以利用Thrust。

本文档介绍了如何使用Thrust开发CUDA应用程序。即使您具有有限的C++或CUDA经验,本教程也旨在易于理解。

如何安装

安装CUDA工具包将把Thrust头文件复制到您系统的标准CUDA include目录中。由于Thrust是一个头文件模板库,因此无需进一步安装即可开始使用Thrust。

此外,新版本的Thrust仍然可以通过GitHub Thrust项目页面在线获取。

官网参考文档:

用法

size_t N = 10;

// raw pointer to device memory
int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));

// wrap raw pointer with a device_ptr
thrust::device_ptr<int> dev_ptr(raw_ptr);

// use device_ptr in thrust algorithms
thrust::fill(dev_ptr, dev_ptr + N, (int) 0);

流同步

// 此示例展示了如何在显式的 CUDA 流上执行 Thrust 设备算法。下面的简单程序将向量填充为数字 [0, 1000)(thrust::sequence),然后对它们执行扫描操作(thrust::inclusive_scan)。这两个算法都使用自定义 CUDA 流使用 CUDA 执行策略执行。

// Thrust 提供了两种接受不同 CUDA 流的执行策略,其在何时/是否同步流方面有所不同:
// 1. thrust::cuda::par.on(stream)
//      - 在算法返回之前,`stream` 将始终被同步。
//      - 当使用 CUDA 设备后端编译时,这是默认的 `thrust::device` 策略。
// 2. thrust::cuda::par_nosync.on(stream)
//      - 只有在必要时才会为了正确性而同步 `stream`(例如从 `thrust:reduce` 返回结果)。这是一个可能被忽略的提示。

#include <thrust/device_vector.h>
#include <thrust/execution_policy.h> // For thrust::device
#include <thrust/reduce.h>
#include <thrust/sequence.h>

#include <cuda_runtime.h>

#include <iostream>

// This example shows how to execute a Thrust device algorithm on an explicit
// CUDA stream. The simple program below fills a vector with the numbers
// [0, 1000) (thrust::sequence) and then performs a scan operation
// (thrust::inclusive_scan) on them. Both algorithms are executed on the same
// custom CUDA stream using the CUDA execution policies.
//
// Thrust provides two execution policies that accept CUDA streams that differ
// in when/if they synchronize the stream:
// 1. thrust::cuda::par.on(stream)
//      - `stream` will *always* be synchronized before an algorithm returns.
//      - This is the default `thrust::device` policy when compiling with the
//        CUDA device backend.
// 2. thrust::cuda::par_nosync.on(stream)
//      - `stream` will only be synchronized when necessary for correctness
//        (e.g., returning a result from `thrust::reduce`). This is a hint that
//        may be ignored by an algorithm's implementation.

int main()
{
  thrust::device_vector<int> d_vec(1000);

  // Create the stream:
  cudaStream_t custom_stream;
  cudaError_t err = cudaStreamCreate(&custom_stream);
  if (err != cudaSuccess)
  {
    std::cerr << "Error creating stream: " << cudaGetErrorString(err) << "\n";
    return 1;
  }

  // Construct a new `nosync` execution policy with the custom stream
  auto nosync_exec_policy = thrust::cuda::par_nosync.on(custom_stream);

  // Fill the vector with sequential data.
  // This will execute using the custom stream and the stream will *not* be
  // synchronized before the function returns, meaning asynchronous work may
  // still be executing after returning and the contents of `d_vec` are
  // undefined. Synchronization is not needed here because the following
  // `inclusive_scan` is executed on the same stream and is therefore guaranteed
  // to be ordered after the `sequence`
  thrust::sequence(nosync_exec_policy, d_vec.begin(), d_vec.end());

  // Construct a new *synchronous* execution policy with the same custom stream
  auto sync_exec_policy = thrust::cuda::par.on(custom_stream);

  // Compute in-place inclusive sum scan of data in the vector.
  // This also executes in the custom stream, but the execution policy ensures
  // the stream is synchronized before the algorithm returns. This guarantees
  // there is no pending asynchronous work and the contents of `d_vec` are
  // immediately accessible.
  thrust::inclusive_scan(sync_exec_policy,
                         d_vec.cbegin(),
                         d_vec.cend(),
                         d_vec.begin());

  // This access is only valid because the stream has been synchronized
  int sum = d_vec.back();

  // Free the stream:
  err = cudaStreamDestroy(custom_stream);
  if (err != cudaSuccess)
  {
    std::cerr << "Error destroying stream: " << cudaGetErrorString(err) << "\n";
    return 1;
  }

  // Print the sum:
  std::cout << "sum is " << sum << std::endl;

  return 0;
}