我有一个20K值的数组,我将它减少到50个块,每个块有400个线程。 num_blocks = 50,为block_size = 400在CUDA上并行缩减和查找索引
我的代码如下所示:
getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val)
{
index[blockIdx.x] = max_index;
}
}
的问题/这里的问题是,在某些时候“nTotalThreads”不完全是2的幂,导致垃圾值为索引。数组out1给出了每个块中的最大值,这是正确和有效的。但是指数的价值是错误的。例如:第一个块中的最大值出现在索引= 40,但内核给出的索引值为15.同样,第二个块中的最大值为440,但内核给出416.
有什么建议?
很多像并行减少常见的模式实现在针对CUDA的高度优化库(如Thrust或CUDPP)中,您是否看过针对您的任务的那些库? – jeff7 2010-10-08 11:41:45
为什么每块400线程,如果你不介意我的问题? – jmilloy 2011-02-24 15:43:54