我已经实现了两个版本的add。两者的加法概念完全相同。唯一的区别是在一个代码中(下面的第一个代码)我使用全局内存,而第二个代码使用共享内存。正如在几个地方提到的那样,共享内存版本应该更快,但就我而言,全局内存版本更快。 请告诉我哪里出错了。注意:我有一个cc 2.1的GPU。因此,对于共享内存,我有32家银行。由于我在示例中仅使用了16个整数,所以我的代码不应该有银行冲突。 请让我知道这是否正确。简单加法示例:共享内存版本的缩减执行速度低于全局内存
全球版本
#include<stdio.h>
__global__ void reductionGlobal(int* in, int sizeArray, int offset){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid < sizeArray){
if(tid % (offset * 2) == 0){
in[tid] += in[tid+offset];
}
}
}
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
int* gidata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int offset = 1;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
while(offset < size){
//use kernel launches to synchronize between different block. syncthreads() will not work
reductionGlobal<<<4,4>>>(gidata,size,offset);
offset *=2;
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int* output = (int*)malloc(size * sizeof(int));
cudaMemcpy(output, gidata, size * sizeof(int), cudaMemcpyDeviceToHost);
printf("The sum of the array using only global memory is %d\n",output[0]);
getchar();
return 0;
}
共享内存版本:
#include<stdio.h>
__global__ void computeAddShared(int *in , int *out, int sizeInput){
extern __shared__ float temp[];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int ltid = threadIdx.x;
temp[ltid] = 0;
while(tid < sizeInput){
temp[ltid] += in[tid];
tid+=gridDim.x * blockDim.x; // to handle array of any size
}
__syncthreads();
int offset = 1;
while(offset < blockDim.x){
if(ltid % (offset * 2) == 0){
temp[ltid] = temp[ltid] + temp[ltid + offset];
}
__syncthreads();
offset*=2;
}
if(ltid == 0){
out[blockIdx.x] = temp[0];
}
}
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
int* gidata;
int* godata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int TPB = 4;
int blocks = 10; //to get things kicked off
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
while(blocks != 1){
if(size < TPB){
TPB = size; // size is 2^sth
}
blocks = (size+ TPB -1)/TPB;
cudaMalloc((void**)&godata, blocks * sizeof(int));
computeAddShared<<<blocks, TPB,TPB>>>(gidata, godata,size);
cudaFree(gidata);
gidata = godata;
size = blocks;
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int *output = (int*)malloc(sizeof(int));
cudaMemcpy(output, gidata, sizeof(int), cudaMemcpyDeviceToHost);
//Cant free either earlier as both point to same location
cudaFree(godata);
cudaFree(gidata);
printf("The sum of the array is %d\n", output[0]);
getchar();
return 0;
}
我也在为最快的表现而战,并且玩了很多方法。超出全局,页面锁定全局,纹理,共享,常量和寄存器...全球记忆是我的最爱。对于dot产品,我可以在单个华硕GTX260 216矩阵版上打4个teraFlops。 您需要设计内核,使内存访问得以合并。全球内存合并速度最快。 – Prafulla
缓存层次结构可能工作得很好。尝试在第二次执行时调整16Kb的L1和48Kb的共享内存。您也可以禁用L1缓存并比较结果。 – pQB
@pQB:如何禁用L1缓存 – Programmer