当CUDA内核的启动参数依赖于先前的内核时，是否需要同步？

我在我的代码中有很多附加模式。基本上，它相当于第一个用于过滤大型数据集的内核，其中返回的选定条目将非常稀疏，然后是第二个内核，用于在大大简化的数据集上执行更多涉及的计算。当CUDA内核的启动参数依赖于先前的内核时，是否需要同步？

似乎cudaStreamSynchronize几乎是多余的，但我看不到任何方式。

是否有避免内核之间同步的替代模式？
CUDA动态并行性会以任何方式提供帮助吗？

示例代码：

/* Pseudocode. Won't Compile */ 
/* Please ignore silly mistakes/syntax and inefficiant/incorrect simplifications */ 

__global__ void bar(const float * dataIn, float * dataOut, unsigned int * counter_ptr) 
{ 
    < do some computation > 
    if (bConditionalComputedAboveIsTrue) 
    { 
     const unsigned int ind = atomicInc(counter_ptr, (unsigned int)(-1)); 
     dataOut[ ind ] = resultOfAboveComputation; 
    } 
} 

int foo(float * d_datain, float* d_tempbuffer, float* d_output, cudaStream_t stream ){  
    /* Initialize a counter that will be updated by the bar kernel */ 
    unsigned int * counter_ptr; 
    cudaMalloc(&counter_ptr, sizeof(unsigned int)); //< Create a Counter 
    cudaMemsetAsync(counter_ptr, 0, sizeof(unsigned int), stream); //<Initially Set the Counter to 0 
    dim3 threadsInit(16,16,1); 
    dim3 gridInit(256, 1, 1); 
    /* Launch the Filtering Kernel. This will update the value in counter_ptr*/ 
    bar<<< gridInit, threadsInit, 0, stream >>>(d_datain, d_tempbuffer, counter_ptr); 
    /* Download the count and synchronize the stream */ 
    unsigned int count; 
    cudaMemcpyAsync(&count, counter_ptr, sizeof(unsigned int), cudaMemcpyDeviceToHost, stream); 
    cudaStreamSynchronize(stream); //< Is there any way around this synchronize? 
    /* Compute the grid parameters and launch a second kernel */ 
    dim3 bazThreads(128,1,1); 
    dim3 bazGrid(count/128 + 1, 1, 1); //< Here I use the counter modified in the prior kernel to set the grid parameters 
    baz<<< bazGrid, bazThreads, 0, stream >>>(d_tempbuffer, d_output); 
    /* cleanup */ 
    cudaFree(counter_ptr); 
}

来源

2017-06-04 Samuel Martin

相反的变化在第二核块的数量，你可以使用一个固定的块数，并有块适应工作他们做的量。

E.g.启动大量的块，并且如果没有工作，就让它们提前退出。或者启动足够的模块来填充设备，并让每个模块循环工作。 Grid-stride loops是这样做的好方法。

也可以选择使用动态并行机制将内核启动本身（因此决定网格大小）移动到设备。

来源

2017-06-04 19:38:48 tera

当CUDA内核的启动参数依赖于先前的内核时，是否需要同步？

回答

相关问题