- 输入输出通道可以被8整除（FP16）或者被4整除（TF32）的时候，可以更好地利用TensorCore。 For the first convolutional layer in most CNNs where the input tensor consists of 3-channel images, padding to 4 channels is sufficient if a stride of 2 is used; see Channels In And Out.
- batch size以及输入输出通道最好被64整除或者被256整除（最好了），可以enable efficient tiling and reduce overhead; see Quantization Effects.
- Larger values for size-related parameters (batch size, input and output height and width, and the number of input and output channels) can improve parallelization. As with fully-connected layers, this speeds up an operation’s efficiency, but does not reduce its absolute duration; see How Convolution Parameters Affect Performance and subsections.
- NVIDIA® libraries offer a set of different convolution algorithms with different performance behaviors, dependent on the convolution’s parameters. When the size of the input processed by the network is the same in each iteration, autotuning is an efficient method to ensure the selection of the ideal algorithm for each convolution in the network. For TensorFlow, autotuning is enabled by default. For PyTorch, enable autotuning by adding
torch.backends.cudnn.benchmark = Trueto your code.
- Choose tensor layouts in memory to avoid transposing input and output data. There are two major conventions, each named for the order of dimensions: NHWC and NCHW. We recommend using the NHWC format where possible. Additional details, including framework support, can be found in Tensor Layouts In Memory: NCHW vs NHWC.