我正在使用PyCUDA,CUDAMat和Numba对GPU矩阵乘法进行基准测试,并遇到了一些行为,我无法找到解释方法。
我计算了3个不同步骤独立需要的时间 - 将2个矩阵发送到设备存储器,计算点积,并将结果复制回主机存储器。
点积步骤的基准测试在一个循环中完成,因为我的应用程序在发送结果之前将进行多次乘法运算。将矩阵复制到主机所需的时间会增加矩阵的使用次数
随着我增加循环次数,点积时间线性增加,就像预期一样。但我无法理解的部分是,将最终结果发送回主机内存所需的时间也随循环次数线性增加,即使它只是将一个矩阵复制回主机内存。无论你做多少个矩阵乘法循环,结果的大小都是恒定的,所以这没有意义。它的行为就好像返回最终结果需要返回循环中每个步骤的所有中间结果。
一些有趣的事情要注意的是,它所花费的时间增加有一个高峰。当我在循环中超过〜1000点产品时,复制最终结果所需的时间达到峰值。 另一件事是,如果在点积循环内,我重新初始化包含结果的矩阵,则无论执行多少次乘法,此行为都会停止,并且复制返回时间相同。
例如 -
for i in range(1000):
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
result = gc.get()
最后一点要注意的是,这种情况两个PyCUDA和Numba,但不与CUDAMat发生。我可以做一百万次乘法,并且检索最终结果仍然需要相同的时间。 CUDAMat有一个内置的矩阵乘法,这可能是为什么,但是对于PyCUDA和Numba,我使用在他们自己的文档中提供的矩阵乘法代码。
这里是我的PyCUDA
代码from __future__ import division
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import time
import pycuda.autoinit
kernel_code_template = """
__global__ void MatrixMulKernel(float *A, float *B, float *C)
{
const int wA = %(MATRIX_SIZE)s;
const int wB = %(MATRIX_SIZE)s;
// Block index
const int bx = blockIdx.x;
const int by = blockIdx.y;
// Thread index
const int tx = threadIdx.x;
const int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
const int aBegin = wA * %(BLOCK_SIZE)s * by;
// Index of the last sub-matrix of A processed by the block
const int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
const int aStep = %(BLOCK_SIZE)s;
// Index of the first sub-matrix of B processed by the block
const int bBegin = %(BLOCK_SIZE)s * bx;
// Step size used to iterate through the sub-matrices of B
const int bStep = %(BLOCK_SIZE)s * wB;
// The element of the block sub-matrix that is computed
// by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B required to
// compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)
{
// Shared memory for the sub-matrix of A
__shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Shared memory for the sub-matrix of B
__shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Load the matrices from global memory to shared memory
// each thread loads one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to global memory;
// each thread writes one element
const int c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
C[c + wB * ty + tx] = Csub;
}
"""
MATRIX_SIZE = 512
TILE_SIZE = 8
BLOCK_SIZE = TILE_SIZE
np.random.seed(100)
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
kernel_code = kernel_code_template % {
'MATRIX_SIZE': MATRIX_SIZE,
'BLOCK_SIZE': BLOCK_SIZE,
}
mod = compiler.SourceModule(kernel_code)
matrixmul = mod.get_function("MatrixMulKernel")
#copy to device memory
total = time.clock()
ga = gpuarray.to_gpu(a_cpu)
gb = gpuarray.to_gpu(b_cpu)
gc = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
copy_to = time.clock() - total
#matrix multiplication
mult = time.clock()
for i in range(1000):
matrixmul(ga, gb, gc, grid=(MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE), block=(TILE_SIZE, TILE_SIZE, 1))
mult = time.clock() - mult
#copy result back to host memory
copy_from = time.clock()
res = gc.get()
copy_from = time.clock() - copy_from
total = time.clock() - total
#print out times for all 3 steps and the total time taken
print(copy_to)
print(mult)
print(copy_from)
print(total)
我曾考虑类似的东西,但不知道如何搜索它。这工作很好。如果你想发布一个答案,我会接受它 – Frobot