0
我写了OpenCL的计划和我执行我的内核是这样OpenCL的开始 - 结束时间谱的时间比实际持续时间
Loop for MultipleGPU{
clEnqueueNDRangeKernel(commandQueues[i], kernel[i], 1, null,
global_work_size, local_work_size, 0, new cl_event[]{userEvent}, events[i]);
clFlush(commandQueues[i]);
}
long before = System.nanoTime();
// Set UserEvent = Complete so all kernel can start executing
clSetUserEventStatus(userEvent, CL_COMPLETE);
// Wait until the work is finished on all command queues
clWaitForEvents(events.length, events);
long after = System.nanoTime();
float totalDurationMs = (after - before)/1e6f;
...profiling each events with CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END...
的userEvent确保在同一时间的内核运行。资料来源:[Reima's Answer]:How do I know if the kernels are executing concurrently?。
而且我得到这个结果从一个系统的2特斯拉K20M GPU在里面:
Total duration :37.800076ms
Duration on device 1 of 2: 38.037186
Duration on device 2 of 2: 37.85744
有人能向我解释为什么始端配置文件时间比总持续时间所花的时间?
谢谢