将矩阵复制到主机所需的时间会增加矩阵的使用次数

我正在使用PyCUDA，CUDAMat和Numba对GPU矩阵乘法进行基准测试，并遇到了一些行为，我无法找到解释方法。
我计算了3个不同步骤独立需要的时间 - 将2个矩阵发送到设备存储器，计算点积，并将结果复制回主机存储器。
点积步骤的基准测试在一个循环中完成，因为我的应用程序在发送结果之前将进行多次乘法运算。将矩阵复制到主机所需的时间会增加矩阵的使用次数

随着我增加循环次数，点积时间线性增加，就像预期一样。但我无法理解的部分是，将最终结果发送回主机内存所需的时间也随循环次数线性增加，即使它只是将一个矩阵复制回主机内存。无论你做多少个矩阵乘法循环，结果的大小都是恒定的，所以这没有意义。它的行为就好像返回最终结果需要返回循环中每个步骤的所有中间结果。

一些有趣的事情要注意的是，它所花费的时间增加有一个高峰。当我在循环中超过〜1000点产品时，复制最终结果所需的时间达到峰值。另一件事是，如果在点积循环内，我重新初始化包含结果的矩阵，则无论执行多少次乘法，此行为都会停止，并且复制返回时间相同。
例如 -

for i in range(1000): 
    gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32) 
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1)) 
result = gc.get()

最后一点要注意的是，这种情况两个PyCUDA和Numba，但不与CUDAMat发生。我可以做一百万次乘法，并且检索最终结果仍然需要相同的时间。 CUDAMat有一个内置的矩阵乘法，这可能是为什么，但是对于PyCUDA和Numba，我使用在他们自己的文档中提供的矩阵乘法代码。

这里是我的PyCUDA

代码

from __future__ import division 
import numpy as np 
from pycuda import driver, compiler, gpuarray, tools 
import time 
import pycuda.autoinit 

kernel_code_template = """ 
__global__ void MatrixMulKernel(float *A, float *B, float *C) 
{ 

    const int wA = %(MATRIX_SIZE)s; 
    const int wB = %(MATRIX_SIZE)s; 

    // Block index 
    const int bx = blockIdx.x; 
    const int by = blockIdx.y; 

    // Thread index 
    const int tx = threadIdx.x; 
    const int ty = threadIdx.y; 

    // Index of the first sub-matrix of A processed by the block 
    const int aBegin = wA * %(BLOCK_SIZE)s * by; 
    // Index of the last sub-matrix of A processed by the block 
    const int aEnd = aBegin + wA - 1; 
    // Step size used to iterate through the sub-matrices of A 
    const int aStep = %(BLOCK_SIZE)s; 

    // Index of the first sub-matrix of B processed by the block 
    const int bBegin = %(BLOCK_SIZE)s * bx; 
    // Step size used to iterate through the sub-matrices of B 
    const int bStep = %(BLOCK_SIZE)s * wB; 

    // The element of the block sub-matrix that is computed 
    // by the thread 
    float Csub = 0; 
    // Loop over all the sub-matrices of A and B required to 
    // compute the block sub-matrix 
    for (int a = aBegin, b = bBegin; 
     a <= aEnd; 
     a += aStep, b += bStep) 
    { 
     // Shared memory for the sub-matrix of A 
     __shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s]; 
     // Shared memory for the sub-matrix of B 
     __shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s]; 

     // Load the matrices from global memory to shared memory 
     // each thread loads one element of each matrix 
     As[ty][tx] = A[a + wA * ty + tx]; 
     Bs[ty][tx] = B[b + wB * ty + tx]; 
     // Synchronize to make sure the matrices are loaded 
     __syncthreads(); 

     // Multiply the two matrices together; 
     // each thread computes one element 
     // of the block sub-matrix 
     for (int k = 0; k < %(BLOCK_SIZE)s; ++k) 
     Csub += As[ty][k] * Bs[k][tx]; 

     // Synchronize to make sure that the preceding 
     // computation is done before loading two new 
     // sub-matrices of A and B in the next iteration 
     __syncthreads(); 
    } 

    // Write the block sub-matrix to global memory; 
    // each thread writes one element 
    const int c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx; 
    C[c + wB * ty + tx] = Csub; 
} 
""" 


MATRIX_SIZE = 512 
TILE_SIZE = 8 
BLOCK_SIZE = TILE_SIZE 
np.random.seed(100) 
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32) 
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32) 

kernel_code = kernel_code_template % { 
    'MATRIX_SIZE': MATRIX_SIZE, 
    'BLOCK_SIZE': BLOCK_SIZE, 
} 
mod = compiler.SourceModule(kernel_code) 
matrixmul = mod.get_function("MatrixMulKernel") 


#copy to device memory 
total = time.clock() 
ga = gpuarray.to_gpu(a_cpu) 
gb = gpuarray.to_gpu(b_cpu) 
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32) 
copy_to = time.clock() - total 

#matrix multiplication 
mult = time.clock() 
for i in range(1000): 
    matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1)) 
mult = time.clock() - mult 

#copy result back to host memory 
copy_from = time.clock() 
res = gc.get() 
copy_from = time.clock() - copy_from 
total = time.clock() - total 

#print out times for all 3 steps and the total time taken 
print(copy_to) 
print(mult) 
print(copy_from) 
print(total)

来源

2017-09-01 Frobot

我曾考虑类似的东西，但不知道如何搜索它。这工作很好。如果你想发布一个答案，我会接受它 – Frobot

GPU内核启动是异步。这意味着您认为您在for循环（执行乘法所花费的时间）周围进行的测量不是那么简单。这只是将内核启动发布到队列所需的时间。

实际的内核执行时间被“吸收”到设备 - >主机拷贝时间的最终测量中（因为D-> H拷贝在所有内核开始之前强制完成并且阻塞CPU线程）。

关于“峰值”行为，当您向队列中启动足够的内核时，最终会停止异步并开始阻塞CPU线程，因此您的“执行时间”度量开始上升。这解释了变化的峰值行为。

“修理”这一点，如果你后立即插入pycuda driver.Context.synchronize()您的for循环，而在此之前行：

mult = time.clock() - mult

，你会看到你的执行时间增加为你增加的循环，您的D-> H复印时间将保持不变。

来源

2017-09-01 00:51:18

和任何人使用Numba与同样的问题，你可以打电话给numba.cuda.synchronize（） – Frobot

将矩阵复制到主机所需的时间会增加矩阵的使用次数

回答

相关问题