CUDA 官方手册 Driver API

imoldpan · 2023 年4 月 5 日 13:14

Context

一个CUDA Context（上下文）类似于一个CPU进程。

在驱动程序API中执行的所有资源和操作都封装在CUDA Context中，系统会在Context被销毁时自动清理这些资源。除了诸如modules and texture or surface references之类的对象外，每个Context都有其自己独特的地址空间。因此，来自不同Context的CUdeviceptr值引用不同的内存位置。

主机线程一次只能有一个设备上下文处于当前状态。当使用cuCtxCreate()创建一个上下文时，它将成为调用主机线程的当前状态。在操作处于某个特定上下文中（大多数与设备枚举或环境管理无关的函数）的CUDA函数返回CUDA_ERROR_INVALID_CONTEXT如果没有有效地将该特定内容设置为线程当前状态。

每个主机线程都有一个当前内容堆栈。cuCtxCreate()将新内容推送到堆栈顶部。可以调用cuCtxPopCurrent()从主机线程分离出该内容。然后该内容就是“浮动”的，并且可以作为任何主机线程的当前内容进行推送. cuCtxPopCurrent()还会恢复以前存在过得现行内容（如果存在）。

对于每个上下文也维护着使用计数器. cuCtxCreate()创建具有1使用计数器值得情况. cuCtxAttach()增加了使用计数器，而cuCtxDetach()则减少了使用计数器。当调用cuCtxDetach()或cuCtxDestroy()时，如果使用计数器为0，则销毁上下文。

驱动程序API与运行时是互操作的，并且可以通过cuDevicePrimaryCtxRetain()从驱动程序API访问由运行时管理的主要上下文（请参阅初始化）。

使用计数有助于在同一上下文中操作第三方编写的代码之间实现互操作性。例如，如果加载了三个库以使用相同的内容，则每个库都会调用cuCtxAttach()来增加使用计数，并在完成对该内容的使用后调用cuCtxDetach()来减少使用计数。对于大多数库，预期应用程序将在加载或初始化库之前创建一个上下文；这样，应用程序就可以根据自己的启发式方法创建上下文，并且该库只需在其手头提供给它进行操作即可。希望创建自己内容并不知道其API客户端是否已经创建过自己内容的图书馆将像以下图示所示那样使用 cuCtxPushCurrent () 和 cuCtxPopCurrent () 。

Module

模块是动态可加载的设备代码和数据包，类似于Windows中的DLL，由nvcc输出（请参见使用NVCC编译）。所有符号的名称，包括函数、全局变量以及纹理或表面引用，在模块范围内维护，因此独立第三方编写的模块可以在同一CUDA上下文中进行互操作。

此代码示例加载一个模块并检索某个内核的handle：

CUmodule cuModule;
cuModuleLoad(&cuModule, "myModule.ptx");
CUfunction myKernel;
cuModuleGetFunction(&myKernel, cuModule, "MyKernel");

This code sample compiles and loads a new module from PTX code and parses compilation errors:

#define BUFFER_SIZE 8192
CUmodule cuModule;
CUjit_option options[3];
void* values[3];
char* PTXCode = "some PTX code";
char error_log[BUFFER_SIZE];
int err;
options[0] = CU_JIT_ERROR_LOG_BUFFER;
values[0]  = (void*)error_log;
options[1] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
values[1]  = (void*)BUFFER_SIZE;
options[2] = CU_JIT_TARGET_FROM_CUCONTEXT;
values[2]  = 0;
err = cuModuleLoadDataEx(&cuModule, PTXCode, 3, options, values);
if (err != CUDA_SUCCESS)
    printf("Link error:\n%s\n", error_log);

This code sample compiles, links, and loads a new module from multiple PTX codes and parses link and compilation errors:

#define BUFFER_SIZE 8192
CUmodule cuModule;
CUjit_option options[6];
void* values[6];
float walltime;
char error_log[BUFFER_SIZE], info_log[BUFFER_SIZE];
char* PTXCode0 = "some PTX code";
char* PTXCode1 = "some other PTX code";
CUlinkState linkState;
int err;
void* cubin;
size_t cubinSize;
options[0] = CU_JIT_WALL_TIME;
values[0] = (void*)&walltime;
options[1] = CU_JIT_INFO_LOG_BUFFER;
values[1] = (void*)info_log;
options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
values[2] = (void*)BUFFER_SIZE;
options[3] = CU_JIT_ERROR_LOG_BUFFER;
values[3] = (void*)error_log;
options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
values[4] = (void*)BUFFER_SIZE;
options[5] = CU_JIT_LOG_VERBOSE;
values[5] = (void*)1;
cuLinkCreate(6, options, values, &linkState);
err = cuLinkAddData(linkState, CU_JIT_INPUT_PTX,
                    (void*)PTXCode0, strlen(PTXCode0) + 1, 0, 0, 0, 0);
if (err != CUDA_SUCCESS)
    printf("Link error:\n%s\n", error_log);
err = cuLinkAddData(linkState, CU_JIT_INPUT_PTX,
                    (void*)PTXCode1, strlen(PTXCode1) + 1, 0, 0, 0, 0);
if (err != CUDA_SUCCESS)
    printf("Link error:\n%s\n", error_log);
cuLinkComplete(linkState, &cubin, &cubinSize);
printf("Link completed in %fms. Linker Output:\n%s\n", walltime, info_log);
cuModuleLoadData(cuModule, cubin);
cuLinkDestroy(linkState);

Kernel Execution

cuLaunchKernel() launches a kernel with a given execution configuration.

Parameters are passed either as an array of pointers (next to last parameter of cuLaunchKernel()) where the nth pointer corresponds to the nth parameter and points to a region of memory from which the parameter is copied, or as one of the extra options (last parameter of cuLaunchKernel()).

When parameters are passed as an extra option (the CU_LAUNCH_PARAM_BUFFER_POINTER option), they are passed as a pointer to a single buffer where parameters are assumed to be properly offset with respect to each other by matching the alignment requirement for each parameter type in device code.

Alignment requirements in device code for the built-in vector types are listed in Table 4. For all other basic types, the alignment requirement in device code matches the alignment requirement in host code and can therefore be obtained using __alignof(). The only exception is when the host compiler aligns double and long long (and long on a 64-bit system) on a one-word boundary instead of a two-word boundary (for example, using gcc’s compilation flag -mno-align-double) since in device code these types are always aligned on a two-word boundary.

CUdeviceptr is an integer, but represents a pointer, so its alignment requirement is __alignof(void*).

The following code sample uses a macro (ALIGN_UP()) to adjust the offset of each parameter to meet its alignment requirement and another macro (ADD_TO_PARAM_BUFFER()) to add each parameter to the parameter buffer passed to the CU_LAUNCH_PARAM_BUFFER_POINTER option.

cuLaunchKernel() 通过给定的执行配置启动一个内核。

参数可以作为指针数组（cuLaunchKernel() 的倒数第二个参数）传递，其中第n个指针对应于第n个参数，并指向从中复制该参数的存储器区域；或者作为额外选项之一（cuLaunchKernel() 的最后一个参数）传递。

当将参数作为额外选项（CU_LAUNCH_PARAM_BUFFER_POINTER 选项）传递时，它们被视为指向单个缓冲区的指针，在该缓冲区中假定各个参数相对于彼此正确地偏移，以匹配设备代码中每种参数类型的对齐要求。

内置矢量类型在设备代码中的对齐要求列在表4 中。对于所有其他基本类型，在设备代码中的对齐要求与主机代码中的对齐要求相匹配，因此可以使用 __alignof() 来获取。唯一例外情况是当主机编译器将 double 和 long long(以及64位系统上的 long) 对准到一个字边界而不是两个字边界时(例如使用gcc编译标志 -mno-align-double) ，因为在设备代码中这些类型始终按两个字边界对齐。

CUdeviceptr 是一个整数，但表示一个指针，因此其对齐要求为 __alignof(void*)。

以下代码示例使用宏(ALIGN_UP())来调整每个参数的偏移量以满足其对齐要求，并使用另一个宏(ADD_TO_PARAM_BUFFER())将每个参数添加到传递给 CU_LAUNCH_PARAM_BUFFER_POINTER 选项的参数缓冲区中。

#define ALIGN_UP(offset, alignment) \
      (offset) = ((offset) + (alignment) - 1) & ~((alignment) - 1)

char paramBuffer[1024];
size_t paramBufferSize = 0;

#define ADD_TO_PARAM_BUFFER(value, alignment)                   \
    do {                                                        \
        paramBufferSize = ALIGN_UP(paramBufferSize, alignment); \
        memcpy(paramBuffer + paramBufferSize,                   \
               &(value), sizeof(value));                        \
        paramBufferSize += sizeof(value);                       \
    } while (0)

int i;
ADD_TO_PARAM_BUFFER(i, __alignof(i));
float4 f4;
ADD_TO_PARAM_BUFFER(f4, 16); // float4's alignment is 16
char c;
ADD_TO_PARAM_BUFFER(c, __alignof(c));
float f;
ADD_TO_PARAM_BUFFER(f, __alignof(f));
CUdeviceptr devPtr;
ADD_TO_PARAM_BUFFER(devPtr, __alignof(devPtr));
float2 f2;
ADD_TO_PARAM_BUFFER(f2, 8); // float2's alignment is 8

void* extra[] = {
    CU_LAUNCH_PARAM_BUFFER_POINTER, paramBuffer,
    CU_LAUNCH_PARAM_BUFFER_SIZE,    &paramBufferSize,
    CU_LAUNCH_PARAM_END
};
cuLaunchKernel(cuFunction,
               blockWidth, blockHeight, blockDepth,
               gridWidth, gridHeight, gridDepth,
               0, 0, 0, extra);

结构体的对齐要求等于其各个字段中对齐要求的最大值。包含内置向量类型、CUdeviceptr 或非对齐 double 和 long long 的结构体在设备代码和主机代码中的对齐需求可能不同。这样的结构体也可能有不同的填充方式。例如，下面这个结构体在主机代码中没有任何填充，但是在设备代码中，在字段 f4 之后会填充 12 字节，因为该字段需要 16 字节对齐。

typedef struct {
    float  f;
    float4 f4;
} myStruct;

Interoperability between Runtime and Driver APIs

一个应用程序可以将运行时 API 代码与驱动程序 API 代码混合使用。

如果通过驱动程序 API 创建并设置了上下文，则后续的运行时调用将会选择该上下文，而不是创建一个新的上下文。

如果初始化了运行时（如CUDA Runtime中所述），则可以使用cuCtxGetCurrent()来检索在初始化期间创建的上下文。此上下文可由后续的驱动程序 API 调用使用。

从运行时隐式创建的上下文称为主要上下文（请参见初始化）。它可以通过Primary Context Management函数从驱动程序 API 进行管理。

设备内存可以使用任一API进行分配和释放。 CUdeviceptr 可以转换为常规指针，反之亦然：

CUdeviceptr devPtr;
float* d_data;

// Allocation using driver API
cuMemAlloc(&devPtr, size);
d_data = (float*)devPtr;

// Allocation using runtime API
cudaMalloc(&d_data, size);
devPtr = (CUdeviceptr)d_data;

特别地，这意味着使用驱动程序API编写的应用程序可以调用使用运行时API编写的库（例如 cuFFT、cuBLAS 等）。

参考手册中设备和版本管理部分中所有函数都可互换地使用。

imoldpan · 2023 年4 月 5 日 14:26

参考：

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html?highlight=context#context