矩阵向量乘积CUDA的性能

我发现了一些关于前一个主题中的cuda矩阵向量乘积的代码： Matrix-vector multiplication in CUDA: benchmarking & performance 我首先想知道为什么作者没有为dA（矩阵）使用共享内存？矩阵向量乘积CUDA的性能

然后，为什么列主要排序比行主要排序快？

下面是代码：

template<typename T> 
__global__ void matvec_kernel(const T * __restrict__ dA, const T * __restrict__ dx, T * __restrict__ dy, const unsigned int nRows, const unsigned int nCols) 
{ 
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    __shared__ T x_shared[BLOCK_SIZE]; 

    T y_val = 0.0; 

    #pragma unroll 
    for (unsigned int m = 0; m < ((nCols + BLOCK_SIZE - 1)/ BLOCK_SIZE); ++m) 
    { 
     if ((m * BLOCK_SIZE + threadIdx.x) < nCols) x_shared[threadIdx.x] = dx[threadIdx.x + m * BLOCK_SIZE]; 
     else           x_shared[threadIdx.x] = 0.f; 
     __syncthreads(); 

     #pragma unroll 
     for (unsigned int e = 0; e < BLOCK_SIZE; ++e) { 
      // --- Column-major ordering - faster 
      y_val += dA[tid + (e + BLOCK_SIZE * m) * nRows] * x_shared[e]; 
      // --- Row-major ordering - slower 
      //y_val += dA[tid * nCols + (e + BLOCK_SIZE * m)] * x_shared[e]; 
     } 

     __syncthreads(); 
    } 

    if (tid < nRows) dy[tid] = y_val;

}

我想对现在是1一天，这些两个问题，这就是为什么我在这里。

非常感谢！

来源

2016-07-28 Titouan Parcollet

此处的共享内存用作缓存。矢量的组成部分将被多次读取，但矩阵的组成部分在计算过程中只能读取一次。这就是代码只缓存向量而不是矩阵的原因。

列主矩阵更快，因为在读取矩阵时，线程沿矩阵列组织。 Col-Major因此确保coalesced global memory access。如果矩阵是主要行，CUDA内核应以不同的方式实现以实现最高性能。

来源

2016-07-28 17:59:13 kangshiyin

完美答案！非常感谢！ –

因此，为了达到行主要的最大性能，我需要使用threadIdx.y和nRows来代替threadIdx.x/nCols（在矩阵读取阶段）？ –

@TitouanParcollet编号它与上面的内核有很大不同。上面一个使用每个矩阵行一个*线程*，除非矩阵非常大，否则在性能方面实际上并不是最优的。对于行主矩阵，可以使用每个矩阵行中的一个*线程块*，并使用并行约简来计算行总和。 – kangshiyin

矩阵向量乘积CUDA的性能

回答

相关问题