2012-01-18 125 views
0

我正在努力学习CUDA。我有一些MPI的基本经验,所以我想我会从一些非常简单的矢量操作开始。我正在尝试编写一个并行化的点积产品。我要么无法为CUDA设备分配/写入内存,要么无法正确地将其返回主机(cudaMemcpy())。CUDA内存分配和访问问题

 /*Code for a CUDA test project doing a basic dot product with doubles 
    * 
    * 
    * 
    */ 
     #include <stdio.h> 
     #include <cuda.h> 

     __global__ void GPU_parallelDotProduct(double *array_a, double *array_b, double   *dot){ 
      dot[0] += array_a[threadIdx.x] * array_b[threadIdx.x]; 
     } 

    __global__ void GPU_parallelSetupVector(double *vector, int dim, int incrSize,   int start){ 
      if(threadIdx.x<dim){ 
       vector[threadIdx.x] = start + threadIdx.x * incrSize; 
      } 
    } 

    __host__ void CPU_serialDot(double *first, double *second, double *dot, int dim){ 
      for(int i=0; i<dim; ++i){ 
      dot[0] += first[i] * second[i]; 
     } 
     } 

    __host__ void CPU_serialSetupVector(double *vector, int dim, int incrSize, int   start){ 
      for(int i=0; i<dim; ++i){ 
      vector[i] = start + i * incrSize; 
     } 
     } 

     int main(){ 
    //define array size to be used 
     //int i,j; 
     int VECTOR_LENGTH = 8; 
     int ELEMENT_SIZE = sizeof(double); 
     //arrays for dot product 
     //host 
     double *array_a = (double*) malloc(VECTOR_LENGTH * ELEMENT_SIZE); 
     double *array_b = (double*) malloc(VECTOR_LENGTH * ELEMENT_SIZE); 
     double *dev_dot_product = (double*) malloc(ELEMENT_SIZE); 
    double host_dot_product = 0.0; 

    //fill with values 
     CPU_serialSetupVector(array_a, VECTOR_LENGTH, 1, 0); 
    CPU_serialSetupVector(array_b, VECTOR_LENGTH, 1, 0); 
    //host dot 
    CPU_serialDot(array_a, array_b, &host_dot_product, VECTOR_LENGTH); 

    //device 
    double *dev_array_a; 
    double *dev_array_b; 
     double *dev_dot; 

    //allocate cuda memory 
    cudaMalloc((void**)&dev_array_a, ELEMENT_SIZE * VECTOR_LENGTH); 
    cudaMalloc((void**)&dev_array_b, ELEMENT_SIZE * VECTOR_LENGTH); 
    cudaMalloc((void**)&dev_dot,  ELEMENT_SIZE); 

    //copy to from host to device 
    cudaMemcpy(dev_array_a, array_a, ELEMENT_SIZE * VECTOR_LENGTH, cudaMemcpyHostToDevice); 
    cudaMemcpy(dev_array_b, array_b, ELEMENT_SIZE * VECTOR_LENGTH, cudaMemcpyHostToDevice); 
    cudaMemcpy(dev_dot, &dev_dot_product, ELEMENT_SIZE, cudaMemcpyHostToDevice); 

    //init vectors 
    //GPU_parallelSetupVector<<<1, VECTOR_LENGTH>>>(dev_array_a, VECTOR_LENGTH, 1, 0); 
    //GPU_parallelSetupVector<<<1, VECTOR_LENGTH>>>(dev_array_b, VECTOR_LENGTH, 1, 0); 
    //GPU_parallelSetupVector<<<1, 1>>>(dev_dot, VECTOR_LENGTH, 0, 0); 
    //perform CUDA dot product 
    GPU_parallelDotProduct<<<1, VECTOR_LENGTH>>>(dev_array_a, dev_array_b, dev_dot); 

    //get computed product back to the machine 
    cudaMemcpy(dev_dot, dev_dot_product, ELEMENT_SIZE, cudaMemcpyDeviceToHost); 

    FILE *output = fopen("test_dotProduct_1.txt", "w"); 
    fprintf(output, "HOST CALCULATION: %f \n", host_dot_product); 
    fprintf(output, "DEV CALCULATION: %f \n", dev_dot_product[0]); 
    fprintf(output, "PRINTING DEV ARRAY VALS: ARRAY A\n"); 
    for(int i=0; i<VECTOR_LENGTH; ++i){ 
     fprintf(output, "value %i: %f\n", i, dev_array_a[i]); 
    } 

    free(array_a); 
    free(array_b); 
    cudaFree(dev_array_a); 
     cudaFree(dev_array_b); 
    cudaFree(dev_dot); 

    return(0); 
    } 

下面是一个例子输出:

HOST CALCULATION: 140.000000 
    DEV CALCULATION: 0.000000 
    PRINTING DEV ARRAY VALS: ARRAY A 
    value 0: -0.000000 
    value 1: 387096841637590350000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 
    value 2: -9188929998371095800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 
    value 3: 242247762331550610000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 
    value 4: -5628111589595087500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 
    value 5: 395077289052074410000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 
    value 6: 0.000000 
    value 7: -13925691551991564000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.000000 

回答

3

这是一个好主意,检查的CUDA运行时调用像cudaMalloc,cudaMemcpy和内核启动状态。您可以在每次这样的调用之后执行以下操作,或者将其封装在某种宏中,并将CUDA运行时调用包装在宏中。

if (cudaSuccess != cudaGetLastError()) 
    printf("Error!\n"); 

现在,我不确定这是否是您的问题,但是这样做可能会显而易见。

+0

我实现了您发布的代码。它会抛出每个CUDA调用。设置CUDA或我的卡时是否缺少某些东西? – Joe 2012-01-18 22:48:34

+0

您正在使用哪些版本的CUDA驱动程序和编译器?从http://developer.nvidia.com/cuda-downloads – keveman 2012-01-18 23:35:41

4

有两个问题,我可以看到:

  1. 你的GPU积包含存储器比赛在这里:

    dot[0] += array_a[threadIdx.x] * array_b[threadIdx.x]; 
    

    这是不安全的 - 块中的每个线程将尝试写/用其结果覆盖相同的内存位置。编程模型不能保证在多线程尝试向同一块内存写入不同值时会发生什么情况。

  2. 当您打印出矢量时,您的代码正试图直接访问主机中的设备内存位置 。我很惊讶 该代码不会产生段错误或保护错误。 dev_array_a不能被主机直接访问,它是GPU内存中的一个 指针。如果您想检查dev_array_a的内容,则必须使用设备将主机复制到有效的 主机位置。

关于在另一个答案中进行错误检查的建议也是一个非常好的观点。每个API调用都会返回一个状态,您应该检查您所做的所有调用的状态,以确认在运行时不会发生错误或错误。

+0

获取最新版本总是一个好主意。是的,现在我意识到了。我必须更加小心。我在想这可能是问题所在。有没有像CUDA的MPI_Reduce()?或者最好是将每个值写入第三个数组,然后浓缩第三个数组?现在我想知道这是否会更快,现在我回到线性时间。 – Joe 2012-01-18 05:01:38

+1

SDK包含一个非常有用的缩减示例和值得关注的白皮书。或者,随CUDA工具包的最新版本一起提供的Thrust模板库具有一个C++实现,并行压缩与STL类矢量类一起工作,该类隐藏大部分设备内存管理,并将您的示例减少到大约十几行的代码。 – talonmies 2012-01-18 05:07:09