Starting with CUDA 10.1, nVidia introduced a new *library* called cuBLASLt, as an alternative to the cuBLAS *function* `cublasGemmEx`

(and related `gemm`

routines). The reasons for using cuBLASLt include, but are not limited to:

- To be able to specify the CUDA stream, and cuBLAS workspace memory, on a per-function-call basis.
- To make use of certain nVidia-provided epilogues fused into the matrix multiplication kernel.
- To make use of integer tensor cores on sufficiently modern GPUs.

On the other hand, there’s one big reason for *not* using cuBLASLt: it is *significantly* more complicated to use than `cublasGemmEx`

.

To start, a `cublasLtHandle_t`

is required - this can come from `cublasLtCreate`

, or an existing `cublasHandle_t`

can be cast to `cublasLtHandle_t`

. However, lots of setup needs to be performed before `cublasLtMatmul`

can be used to actually compute a matrix multiplication. The first piece of setup is to initialise a `cublasLtMatmulDesc_t`

object describing some of the attributes of the matrix multiplication. Such an object is created by `cublasLtMatmulDescCreate`

followed by zero or more calls to `cublasLtMatmulDescSetAttribute`

. Some of the common attributes on this object are:

Attribute | Default Value |
---|---|

CUBLASLT_MATMUL_DESC_COMPUTE_TYPE | cublasLtMatmulDescCreate parameter |

CUBLASLT_MATMUL_DESC_SCALE_TYPE | cublasLtMatmulDescCreate parameter |

CUBLASLT_MATMUL_DESC_POINTER_MODE | CUBLASLT_POINTER_MODE_HOST |

CUBLASLT_MATMUL_DESC_TRANSA | CUBLAS_OP_N |

CUBLASLT_MATMUL_DESC_TRANSB | CUBLAS_OP_N |

CUBLASLT_MATMUL_DESC_TRANSC | CUBLAS_OP_N |

CUBLASLT_MATMUL_DESC_EPILOGUE | CUBLASLT_EPILOGUE_DEFAULT (none) |

Next up, a `cublasLtMatrixLayout_t`

object needs to be initialised for each of the three matrix shapes involved in the matrix multiplication. Such an object is created by `cublasLtMatrixLayoutCreate`

followed by zero or more calls to `cublasLtMatrixLayoutSetAttribute`

. Some of the common attributes on this object are:

Attribute | Default Value |
---|---|

`CUBLASLT_MATRIX_LAYOUT_TYPE` |
`cublasLtMatrixLayoutCreate` parameter |

`CUBLASLT_MATRIX_LAYOUT_ROWS` |
`cublasLtMatrixLayoutCreate` parameter |

`CUBLASLT_MATRIX_LAYOUT_COLS` |
`cublasLtMatrixLayoutCreate` parameter |

`CUBLASLT_MATRIX_LAYOUT_LD` |
`cublasLtMatrixLayoutCreate` parameter |

`CUBLASLT_MATRIX_LAYOUT_ORDER` |
`CUBLASLT_ORDER_COL` |

`CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT` |
`1` (not batched) |

The third thing which `cublasLtMatmul`

requires is a `cublasLtMatmulAlgo_t`

, but such a thing isn’t created directly. Instead, a `cublasLtMatmulPreference_t`

object is initialised, and `cublasLtMatmulAlgoGetHeuristic`

is used to take all of the previously created objects and spit out a list of potential `cublasLtMatmulAlgo_t`

objects. A `cublasLtMatmulPreference_t`

object is created by `cublasLtMatmulPreferenceCreate`

followed by zero or more calls to `cublasLtMatmulPreferenceSetAttribute`

. Some of the common attributes on this object are:

Attribute | Default Value |
---|---|

`CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES` |
`0` |

`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES` |
`256` |

`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES` |
`256` |

`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES` |
`256` |

`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES` |
`256` |

With all these objects in hand, `cublasLtMatmulAlgoGetHeuristic`

can be called to populate an array of `cublasLtMatmulHeuristicResult_t`

objects. Once populated, the `algo`

field is a ready-to-use `cublasLtMatmulAlgo_t`

. This field is relatively opaque, but some information about it can be obtained if desired. This information comes in three places: firstly there are other fields on `cublasLtMatmulHeuristicResult_t`

(namely `wavesCount`

and `workspaceSize`

), secondly read-only attributes can be queried using `cublasLtMatmulAlgoCapGetAttribute`

(e.g. `CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS`

and `CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES`

through `CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES`

), and thirdly read-write attributes can be queried using `cublasLtMatmulAlgoConfigGetAttribute`

.

With a `cublasLtMatmulAlgo_t`

object chosen (typically the front of the array passed to `cublasLtMatmulAlgoGetHeuristic`

), `cublasLtMatmul`

can finally be used. This function computes `D = alpha * (A @ B) + beta * C`

, which when `C == D`

is equivalent to `cublasGemmEx`

.

That concludes basic usage notes, but if `CUBLAS_COMPUTE_32I`

(or `CUBLAS_COMPUTE_32I_PEDANTIC`

) is being used, then there’s another whole chapter of usage notes. This chapter starts by noting that the list of supported configurations for integer matrix multiplication is (at least currently) very limited:

`CUDA_R_32I`

destination, computation using legacy CUDA cores:

- scaleType must be
`CUDA_R_32I`

, but only`0`

or`1`

supported - A matrix must be
`CUDA_R_8I`

with either`CUBLASLT_ORDER_COL`

or`CUBLASLT_ORDER_ROW`

- B matrix must be
`CUDA_R_8I`

with either`CUBLASLT_ORDER_COL`

or`CUBLASLT_ORDER_ROW`

- C matrix must be
`CUDA_R_32I`

with either`CUBLASLT_ORDER_COL`

or`CUBLASLT_ORDER_ROW`

`CUBLASLT_MATMUL_DESC_EPILOGUE`

must be`CUBLASLT_EPILOGUE_DEFAULT`

`CUDA_R_8I`

destination, computation using legacy CUDA cores:

- scaleType must be
`CUDA_R_32F`

- A matrix must be
`CUDA_R_8I`

with`CUBLAS_OP_T`

- B matrix must be
`CUDA_R_8I`

with`CUBLAS_OP_N`

- C matrix must be
`CUDA_R_8I`

`CUBLASLT_MATMUL_DESC_EPILOGUE`

must be`CUBLASLT_EPILOGUE_DEFAULT`

`CUDA_R_32I`

destination, computation using integer tensor cores:

- scaleType must be
`CUDA_R_32I`

, but only`0`

or`1`

supported - A matrix must be
`CUDA_R_8I`

with`CUBLAS_OP_N`

and`CUBLASLT_ORDER_COL32`

- B matrix must be
`CUDA_R_8I`

with`CUBLAS_OP_T`

and`CUBLASLT_ORDER_COL4_4R2_8C`

(Turing or Ampere) or`CUBLASLT_ORDER_COL32_2R_4R4`

(Ampere) - C matrix must be
`CUDA_R_32I`

with`CUBLAS_OP_N`

and`CUBLASLT_ORDER_COL32`

`CUBLASLT_MATMUL_DESC_EPILOGUE`

must be`CUBLASLT_EPILOGUE_DEFAULT`

`CUDA_R_8I`

destination, computation using integer tensor cores:

- scaleType must be
`CUDA_R_32F`

- A matrix must be
`CUDA_R_8I`

with`CUBLAS_OP_N`

and`CUBLASLT_ORDER_COL32`

- B matrix must be
`CUDA_R_8I`

with`CUBLAS_OP_T`

and`CUBLASLT_ORDER_COL4_4R2_8C`

(Turing or Ampere) or`CUBLASLT_ORDER_COL32_2R_4R4`

(Ampere) - C matrix must be
`CUDA_R_8I`

with`CUBLAS_OP_N`

and`CUBLASLT_ORDER_COL32`

Of particular note are the strange `CUBLASLT_MATRIX_LAYOUT_ORDER`

values required for using integer tensor cores. In C/C++ terminology, `CUBLASLT_ORDER_COL`

can be thought of as a two-dimensional array indexed as `[I][J]`

, with the `[J]`

part packed densely, and the leading dimension specifying how to advance by one in the `[I]`

part. In the same terminology, `CUBLASLT_ORDER_COL32`

can be thought of as a three-dimensional array indexed as `[I/32][J][I%32]`

, with the `[J][I%32]`

part packed densely, and the leading dimension specifying how to advance by one in the `[I/32]`

part.

The `CUBLASLT_ORDER_COL4_4R2_8C`

and `CUBLASLT_ORDER_COL32_2R_4R4`

layouts are even more exotic. Rather than trying to explain their layout, it is best to consider them to be completely opaque layouts. Thankfully, the `cublasLtMatrixTransform`

function is provided to convert between layouts, so a matrix can be constructed using a known simple layout (such as `CUBLASLT_ORDER_COL`

or `CUBLASLT_ORDER_ROW`

) and then converted to `CUBLASLT_ORDER_COL4_4R2_8C`

or `CUBLASLT_ORDER_COL32_2R_4R4`

using `cublasLtMatrixTransform`

. To use `cublasLtMatrixTransform`

, a `cublasLtMatrixTransformDesc_t`

object is required. Such an object is created by `cublasLtMatrixTransformDescCreate`

followed by zero or more calls to `cublasLtMatrixTransformDescSetAttribute`

. Some of the common attributes on this object are:

Attribute | Default Value |
---|---|

`CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE` |
`cublasLtMatrixTransformDescCreate` parameter |

`CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE` |
`CUBLASLT_POINTER_MODE_HOST` |

`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA` |
`CUBLAS_OP_N` |

`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB` |
`CUBLAS_OP_N` |

`cublasLtMatrixTransform`

can also be used to convert in or out of `CUBLASLT_ORDER_COL32`

, but said conversions obviously come with a time cost, so it is better to keep matrices in the `CUBLASLT_ORDER_COL32`

format, and perform all operations on them in that layout. This might mean rewriting a bunch of CUDA kernels to understand the layout, if said kernels care about the two dimensional structure of the matrix.