我读过Shuffle Tips and Tricks纸,但我不知道究竟是如何将其应用到一些狡猾的代码,我继承:了解CUDA SHFL指令
extern __shared__ unsigned int lpSharedMem[];
int tid = threadIdx.x;
lpSharedMem[tid] = startValue;
volatile unsigned int *srt = lpSharedMem;
// ...various stuff
srt[tid] = min(srt[tid], srt[tid+32]);
srt[tid] = min(srt[tid], srt[tid+16]);
srt[tid] = min(srt[tid], srt[tid+8]);
srt[tid] = min(srt[tid], srt[tid+4]);
srt[tid] = min(srt[tid], srt[tid+2]);
srt[tid] = min(srt[tid], srt[tid+1]);
__syncthreads();
即使没有CUDA,这个代码是模模糊糊,但看着this implementation我看到:
__device__ inline int min_warp(int val) {
val = min(val, __shfl_xor(val, 16));
val = min(val, __shfl_xor(val, 8));
val = min(val, __shfl_xor(val, 4));
val = min(val, __shfl_xor(val, 2));
val = min(val, __shfl_xor(val, 1));
return __shfl(val, 0);
}
此代码可能是调用与:
int minVal = min_warp(startValue);
因此,我可以用上面的代码替换我相当不利的volatile
。但是,我无法真正理解正在发生的事情;有人可以解释我是否正确,以及min_warp()
函数中究竟发生了什么。
看看这个https://devblogs.nvidia.com/parallelforall/faster-平行削减-开普勒/ – Hopobcn