使用共享内存在cuda内核中应用高斯掩模

-1

我正在尝试完成udacity“并行编程介绍”课程的作业，并且我被困在第二个任务中，它基本上将高斯模糊蒙版应用于使用CUDA的图像。我想通过利用共享内存来有效地完成此操作。我的想法是解决“在边界问题的像素”问题，以便启动比块中像素的实际数量多的线程：例如，如果我将输入图像分成16x16大小的活动像素块和I有一个9x9大小的面具，那么我的实际块尺寸将为（对于x和y）：16 + 2 *（9/2）= 24。这样，我在一个块中启动24个线程，以便“ “线程将仅用于将像素从输入img加载到共享内存，而”内部“线程则对应于实际执行计算的活动像素（另外还会在共享内存中进行缓存）。使用共享内存在cuda内核中应用高斯掩模

由于某种原因，它不起作用。从附加代码中可以看到，我可以将像素缓存到共享内存中，但是在计算过程中出现了一些错误，并且附上我得到的糟糕结果的图像。

   __global__ void gaussian_blur(const unsigned char* const inputChannel, 
       unsigned char* const outputChannel, 
       int numRows, int numCols, 
       const float* const filter, const int filterWidth) 
       { 

int filter_radius = (int)(filterWidth/2); //getting the filter "radius" 

int x = blockDim.x*blockIdx.x+threadIdx.x; 
int y = blockDim.y*blockIdx.y+threadIdx.y; 

if(x>=(numCols+filter_radius) || y>=(numRows+filter_radius)) 
    return; 

int px = x-filter_radius; 
int py = y-filter_radius; 

//clamping 

if(px<0) px = 0; 
if(py<0) py = 0; 
//if(px>=numCols) px = numCols-1; 
// if(py>=numRows) py = numRows-1; 

__shared__ unsigned char tile[(16+8)*(16+8)]; //16 active pixels + 2*filter_radius 

tile[threadIdx.y*24+threadIdx.x] = inputChannel[py*numCols+px]; 

__syncthreads(); 

//Here everything is working fine: if I do 
// outputChannel[py*numCols+px] = tile[threadIdx.y*24+threadIdx.x]; 
//then I am able to see the perfect reconstruction of the input image. 

//caching the filter 
__shared__ float t_filter[81]; //9x9 conv mask 

if(threadIdx.x==0 && threadIdx.y==0) 
{ 
    for(int i=0; i<81; i++) 
     t_filter[i] = filter[i]; 
} 

__syncthreads(); 


//I am checking the threadIdx of the threads and I am performing the mask computation 
//only to those threads that are pointing to active pixels: 
//i.e. all the threads whose id is greater or equal to the filter radius, 
//but smaller than the whole block of active pixels will perform the computation. 
//filter_radius = filterWidth/2 = 9/2 = 4 
//blockDim.x or y = 16 + filterWidth*2 = 16+8 = 24 
//active pixel index limit = filter_radius+16 = 4+16 = 20 
//is that correct? 


if( 
    threadIdx.y>=filter_radius && threadIdx.x>=filter_radius && 
    threadIdx.x < 20 && threadIdx.y < 20 
) 
{ 

    float value = 0.0; 

    for(int i=-filter_radius; i<=filter_radius; i++) 
     for(int j=-filter_radius; j<=filter_radius; j++) 
     { 
      int fx = i+filter_radius; 
      int fy = j+filter_radius; 

      int ty = threadIdx.y+i; 
      int tx = threadIdx.x+j; 

      value += ((float)tile[ty*24+tx])*t_filter[fy*filterWidth+fx]; 
     } 
    outputChannel[py*numCols+px] = (unsigned char) value; 
}

输出图像：http://i.stack.imgur.com/EMu5M.png

编辑：添加内核调用：

int filter_radius = (int) (filterWidth/2); 
    blockSize.x = 16 + 2*filter_radius; 
    blockSize.y = 16 + 2*filter_radius; 
    gridSize.x = numCols/16+1; 
    gridSize.y = numRows/16+1; 

    printf("\n grx %d gry %d \n", blockSize.x, blockSize.y); 

    gaussian_blur<<<gridSize, blockSize>>>(d_red, d_redBlurred, numRows,numCols, d_filter, filterWidth); 
    gaussian_blur<<<gridSize, blockSize>>>(d_green, d_greenBlurred, numRows,numCols, d_filter, filterWidth); 
    gaussian_blur<<<gridSize, blockSize>>>(d_blue, d_blueBlurred, numRows,numCols, d_filter, filterWidth); 

    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); 

    blockSize.x = 32; gridSize.x = numCols/32+1; 
    blockSize.y = 32; gridSize.y = numRows/32+1; 

    // Now we recombine your results. We take care of launching this kernel for you. 
    // 
    // NOTE: This kernel launch depends on the gridSize and blockSize variables, 
    // which you must set yourself. 
    recombineChannels<<<gridSize, blockSize>>>(d_redBlurred, 
              d_greenBlurred, 
              d_blueBlurred, 
              d_outputImageRGBA, 
              numRows, 
              numCols); 
    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

编辑二：

所有其他必要的，以便编译和运行可以在这里找到代码： https://github.com/udacity/cs344/tree/master/Problem%20Sets/Problem%20Set%202 以上内核应该在student_func.cu文件中编码。

来源

2015-10-05 alef0

从[这里]（http://stackoverflow.com/help/on-topic）：“求调试帮助（问题：”为什么不是这个代码工作？“）必须包括理想的行为，特定的问题或错误，以及在问题本身中重现问题所需的最短代码，没有明确问题陈述的问题对其他读者无益，参见：[如何创建最小，完整且可验证的示例（MCVE）]（http://stackoverflow.com/help/mcve）“。 CUDA内核本身不是MCVE。最好，你的MCVE应该是独立的，并且不应该要求OpenCV或其他框架，或独立的数据文件。 –

对不起，因为我在这里浏览了一些CUDA的问题，没有一个显示整个事情。其中一些可能会显示内核调用本身，但我很确定在处理图像时，它们都没有提供自己的函数来读取和输出图像文件，因此避免使用OpenCV或其他框架。我正在添加内核调用并发布到编译所需的其他文件的链接。我认为这应该足够了。至于这个代码应该做什么，我认为这是很好解释。 – alef0

在您的实现中，每个块永远不会计算边界（在边缘的一个滤镜半径内）像素的模糊。这意味着你希望你的块重叠，以便覆盖边界。如果你看一下x指数的网域，每个块

int x = blockDim.x*blockIdx.x+threadIdx.x;

给特定内核执行上面我们会有

blockIdx.x = 0: x = [0,23] 
blockIdx.x = 1: x = [24,46] 
... etc

正如你可以看到每块会考虑你的形象的一个独特之处，但是你已经告诉每个块不要在边界上计算。这意味着每个块的边界从计算中被忽略（因此图像中的黑色网格）。

你需要的东西来计算你的指数一样

int x = (blockDim.x-2*filter_radius)*blockIdx.x+threadIdx.x;

使块重叠。现在，我们对我们的x指数域看起来像

blockIdx.x = 0: x = [0,23] 
blockIdx.x = 1: x = [16,39] 
... etc

来源

2015-10-06 00:16:15 inJeans

使用共享内存在cuda内核中应用高斯掩模

回答

相关问题