为什么我的程序在我的CPU设备上运行得比在我的GPU设备上快得多？

我对OpenCL相当陌生。我在大学时学到了一点点，有一点我的意思是我的图形教授在一天中教会了我们关于GPGPU和OpenCL的知识（而其他人则专注于着色器和OpenGL等等）。为什么我的程序在我的CPU设备上运行得比在我的GPU设备上快得多？

我拿了一个示例程序，并将其改为与我希望它运行的计算一起工作。但是，我的程序在CPU上运行速度比我的GPU快得多，我试图理解为什么。

我的程序需要一个输入浮点数组，并有两个输出数组。在单线程情况下，它有三个参数。输入数组的大小为：samplesPerTrace tracesIn sizeof（float），并且输出数组的大小为：samplesPerTrace tracesOut sizeof（float）。

我的测试用例一直使用参数25000 2500 250，因为平均而言，我将使用的数组大小（可能略高于平均值）。这些值是随机填写的。

这是OpenCL在内核上构建和运行的源代码;

const char* M_AND_S_OPENCL_SOURCE_TEXT = 
"__kernel void sumAllCL(__global const float prestackTraces[],\n" 
" __global float stackTracesOut[],\n" 
" __global float powerTracesOut[], const unsigned int nTracesOut, const unsigned int nTracesIn,\n" 
" const unsigned int samplesPerTrace) {\n" 
"\n" 
" unsigned int k = get_global_id(0);\n" // Thread ID 
"\n" 
" unsigned int kTimesIn = k * nTracesIn;\n" // Store repeat ints 
" unsigned int kTimesSamples = k * samplesPerTrace;\n" 
"\n" 
" for (int j = 0; j < ?  ; j++) {\n" // ? position to be replaced (nTracesOut)" 
"\n" 
" int jTimesSamplesPT = j * samplesPerTrace;\n" 
"\n" 
" for (int i = 0; i < #  ; i++) {\n" // # position to be replaced() 
"\n" 
"  int valueIndex = i + jTimesSamplesPT;\n" 
"  float value = prestackTraces[valueIndex];\n" 
"\n" 
"  stackTracesOut[i + kTimesSamples] += value;\n" 
"  powerTracesOut[i + kTimesSamples] += (value * value);\n" 
"\n" 
" }\n" 
" }\n" 
"}\n";

请注意，和＃在运行时用固定数字替换，我这样做是因为我认为它会帮助编译器展开rl

使用上述参数（25000 2500 250〜10 < 1或2>）， CPU约0.6秒完成程序和我的GPU约40秒完成。这是一个更大的差异。 Fyi，我一直在搞第四个参数来看看哪个值运行得更快，这就是〜10的含义。

我的显卡是微星Radeon R9 390X 8GB，名字叫夏威夷。当我有OpenCL的打印输出约我的两个设备的信息，这是我所得到的：

OpenCL Platform 0: AMD Accelerated Parallel Processing 
----- OpenCL Device # 0: Hawaii----- 
Gflops: 47.520000 
Max Clock Frequency: 1080 
Max Compute Units: 44 
Max Work Group Size: 256 
    MEMORY... 
Total Memory of Device: 8.000G (CL_DEVICE_GLOBAL_MEM_SIZE) 
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE) 
Max Memory Object Allocation: 3.999G (CL_DEVICE_MAX_MEM_ALLOC_SIZE) 
Cache Size: 16.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE) 
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) 
    VERSIONS... 
Device Vendor: Advanced Micro Devices, Inc. 
Device Version: OpenCL 2.0 AMD-APP (2117.13) 
Driver Version: 2117.13 (VM) 
Device OpenCL Version: OpenCL C 2.0 
----- OpenCL Device # 1: Intel(R) Core(TM) i7-6700K CPU ? 4.00GHz----- 
Gflops: 32.064000 
Max Clock Frequency: 4008 
Max Compute Units: 8 
Max Work Group Size: 1024 
    MEMORY... 
Total Memory of Device: 15.967G (CL_DEVICE_GLOBAL_MEM_SIZE) 
Local Memory of Device: 32.000K (CL_DEVICE_LOCAL_MEM_SIZE) 
Max Memory Object Allocation: 3.1028G (CL_DEVICE_MAX_MEM_ALLOC_SIZE) 
Cache Size: 32.000K (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE) 
Cacheline Size: 64 bytes (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) 
    VERSIONS... 
Device Vendor: GenuineIntel 
Device Version: OpenCL 1.2 AMD-APP (2117.13) 
Driver Version: 2117.13 (sse2,avx) 
Device OpenCL Version: OpenCL C 1.2

下面是相关的OpenCL代码。我会发布一个完整的最小可验证的完整示例，但它使我超过了字符数限制。

/* 
* Prints the given int (numToInsert) at location inside chars. 
*/ 
void PrintIntInStr(char* chars, int location, int numToInsert) { 

    std::stringstream strs; 
    strs << numToInsert; 
    std::string temp_str = strs.str(); 
    char const* numToChars = temp_str.c_str(); 

    int numberLength = strlen(numToChars); 

    int w; 
    for (w = 0; w < numberLength; w++) { 
    chars[location + w] = numToChars[w]; 
    } 
} 

/* 
* Initialize fastest OpenCL device. 
*/ 
int InitOpenCL(int verbose, cl_int deviceType) { 

    cl_uint Nplat; 
    cl_int err; 
    char name[1024]; 
    int MaxGflops = -1; 

    cl_platform_id winnerPlatform = 0; 

    // Reset (TODO) 
    _deviceID = NULL; 
    _context = NULL; 
    _queue = NULL; 

    // Get platforms 
    cl_platform_id platforms[4]; 
    if (clGetPlatformIDs(4, platforms, &Nplat)) Fatal("Cannot get number of OpenCL platforms\n"); 
    else if (Nplat<1) Fatal("No OpenCL platforms found\n"); 

    // Loop over platforms 
    for (unsigned int platform = 0; platform < Nplat; platform++) { 

    if (clGetPlatformInfo(platforms[platform], CL_PLATFORM_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL platform name\n"); 
    if (verbose) printf("OpenCL Platform %d: %s\n", platform, name); 

    // Get GPU device IDs 
    cl_uint Ndev; 
    cl_device_id id[4]; 
    if (clGetDeviceIDs(platforms[platform], deviceType, 4, id, &Ndev)) 
     Fatal("Cannot get number of OpenCL devices: %d\n", platform); 
    else if (Ndev < 1) Fatal("No OpenCL devices found.\n"); 

    // Find the fastest device 
    for (unsigned int devId = 0; devId < Ndev; devId++) { 

     // Print informatio about the device 
     cl_uint compUnits, freq, cacheLineSize; 
     cl_ulong memSize, maxAlloc, localMemSize, globalCacheSize; 
     size_t maxWorkGrps; 
     char deviceVendor[50]; 
     char deviceVersion[50]; 
     char driverVersion[50]; 
     char deviceOpenCLVersion[50]; 

     // Computing Power... 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compUnits), &compUnits, NULL)) Fatal("Cannot get OpenCL device units\n"); 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, NULL)) Fatal("Cannot get OpenCL device frequency\n"); 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n"); 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGrps), &maxWorkGrps, NULL)) Fatal("Cannot get OpenCL max work group size\n"); 
     // Memory... 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(memSize), &memSize, NULL)) Fatal("Cannot get OpenCL memory size.\n"); 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(localMemSize), &localMemSize, NULL)) localMemSize = 0; 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(maxAlloc), &maxAlloc, NULL)) Fatal("Cannot get OpenCL memory size.\n"); 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, sizeof(globalCacheSize), &globalCacheSize, NULL)) globalCacheSize = 0; 
     if (clGetDeviceInfo(id[devId], CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, sizeof(cacheLineSize), &cacheLineSize, NULL)) cacheLineSize = 0; 
     // Versions... 
     clGetDeviceInfo(id[devId], CL_DEVICE_VENDOR, sizeof(deviceVendor), deviceVendor, NULL); 
     clGetDeviceInfo(id[devId], CL_DEVICE_VERSION, sizeof(deviceVersion), deviceVersion, NULL); 
     clGetDeviceInfo(id[devId], CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL); 
     clGetDeviceInfo(id[devId], CL_DEVICE_OPENCL_C_VERSION, sizeof(deviceOpenCLVersion), deviceOpenCLVersion, NULL); 

     int Gflops = compUnits * freq; 

     if (verbose) printf(" ----- OpenCL Device # %d: %s-----\n" 
     "Gflops: %f\n" 
     "Max Clock Frequency: %d\n" 
     "Max Compute Units: %d\n" 
     "Max Work Group Size: %zu\n" 
     " MEMORY...\n" 
     "Total Memory of Device: %s (CL_DEVICE_GLOBAL_MEM_SIZE)\n" 
     "Local Memory of Device: %s (CL_DEVICE_LOCAL_MEM_SIZE)\n" 
     "Max Memory Object Allocation: %s (CL_DEVICE_MAX_MEM_ALLOC_SIZE)\n" 
     "Cache Size: %s (CL_DEVICE_GLOBAL_MEM_CACHE_SIZE)\n" 
     "Cacheline Size: %s (CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE)\n" 
     " VERSIONS...\n" 
     "Device Vendor: %s\n" 
     "Device Version: %s\n" 
     "Driver Version: %s\n" 
     "Device OpenCL Version: %s\n", 
     devId, 
     name, 
     (1e-3 * Gflops), 
     freq, 
     compUnits, 
     maxWorkGrps, 
     byteConverter((unsigned long)memSize), 
     byteConverter((unsigned long)localMemSize), 
     byteConverter((unsigned long)maxAlloc), 
     byteConverter((unsigned long)globalCacheSize), 
     byteConverter((unsigned long)cacheLineSize), 
     deviceVendor, 
     deviceVersion, 
     driverVersion, 
     deviceOpenCLVersion); 

     if(Gflops > MaxGflops) 
     { 
     _deviceID = id[devId]; 
     MaxGflops = Gflops; 

     winnerPlatform = platforms[platform]; 
     } 
    } 
    } 

    // Print fastest device info (TODO: don't get name twice) 
    if (clGetDeviceInfo(_deviceID, CL_DEVICE_NAME, sizeof(name), name, NULL)) Fatal("Cannot get OpenCL device name\n"); 
    printf("\n Selected Fastest Open CL Device: %s (#%lu)\n", name, (unsigned long)_deviceID); 

    // Check thread count 
    size_t mwgs; 
    if (clGetDeviceInfo(_deviceID, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(mwgs), &mwgs, NULL)) 
    Fatal("Cannot get OpenCL max work group size\n"); 

    // Create OpenCL context for fastest device 
    cl_context_properties cps[3] = 
    { 
    CL_CONTEXT_PLATFORM, 
    (cl_context_properties)winnerPlatform, 
    (cl_context_properties)0 
    }; 
    _context = clCreateContextFromType(cps, deviceType, NULL, NULL, &err); 
    if (!_context || err) Fatal("Cannot create OpenCL Context\n"); 

    // Properties for create command queue; currently nothing 
    // cl_command_queue_properties *propers; 
    cl_command_queue_properties prop = 0; 
    //prop |= CL_QUEUE_PROFILING_ENABLE; 
    //prop |= CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE; 
    // propers = &prop; 

    _queue = clCreateCommandQueueWithProperties(_context, _deviceID, &prop, &err); // Create OpenCL command queue for fastest device 
    // _queue = clCreateCommandQueue(_context, _deviceID, &prop, &err); 
    if (!_queue || err) { 
    if (err == CL_INVALID_CONTEXT) Fatal("Cannot create OpenCL command cue: CL_INVALID_CONTEXT\n"); 
    else if (err == CL_INVALID_DEVICE) Fatal("Cannot create OpenCL command cue: CL_INVALID_DEVICE\n"); 
    else if (err == CL_INVALID_VALUE) Fatal("Cannot create OpenCL command cue: CL_INVALID_VALUE\n"); 
    else if (err == CL_INVALID_QUEUE_PROPERTIES) Fatal("Cannot create OpenCL command cue: CL_INVALID_QUEUE_PROPERTIES\n"); 
    else if (err == CL_OUT_OF_RESOURCES) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_RESOURCES\n"); 
    else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create OpenCL command cue: CL_OUT_OF_HOST_MEMORY\n"); 
    else if (!_queue) Fatal("Cannot create OpenCL command cue: !queue\n"); 
    else Fatal("Cannot create OpenCL command cue: ?????\n"); 
    } 

    if (_VERBOSE) printf("Init complete.\n"); 

    return mwgs; 
} 

/* 
* Modify the source text to fit this run. 
*/ 
char* ModifySourceText(unsigned int nTracesIn, unsigned int samplesPerT) { 

    size_t sourceSize = strlen(M_AND_S_OPENCL_SOURCE_TEXT) + 1; 
    char* moveStackSourceCode = new char[sourceSize]; 
    strncpy(moveStackSourceCode, M_AND_S_OPENCL_SOURCE_TEXT, sourceSize); 
    moveStackSourceCode[sourceSize] = '\0'; 

    // Print out the locations of the characters where we should insert other text if asked to do so 
    if (_FIND_INSERT_LOCATIONS) { 
    size_t z; 
    for (z = 0; z < sourceSize; z++) { 
     if (moveStackSourceCode[z] == '@') { 
     printf("Found @ at position %zu\n", z); 
     break; 
     } 
    } 
    for (z = 0; z < sourceSize; z++) { 
     if (moveStackSourceCode[z] == '#') { 
     printf("Found # at position %zu\n", z); 
     break; 
     } 
    } 
    } 

    // Insert the digit that for loops go to inside of the source 
    PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_1, nTracesIn); 
    PrintIntInStr(moveStackSourceCode, INSERT_LOCATION_2, samplesPerT); 

    // Print the modified source code if verbose 
    if (_FIND_INSERT_LOCATIONS) { 
    printf("\n GPU Source Code: \n"); 
    printf("%s\n", moveStackSourceCode); 
    } 

    return moveStackSourceCode; 
} 

/* 
* Wait for event and then release it. 
*/ 
static void WaitForEventAndRelease(cl_event *event) { 

    printf("WaitForEventAndRelease()\n"); 

    cl_int status = CL_SUCCESS; 

    status = clWaitForEvents(1, event); 
    if (status) Fatal("clWaitForEvents Failed with Error Code"); 

    printf("About to release event...\n"); 

    status = clReleaseEvent(*event); 
    if (status) Fatal("clReleaseEvent Failed with Error Code"); 
} 


// Runs the program via open CL 
static double RunOpenCL(float prestackTracesArray[], float stackTracesOut1DArray[], float powerTracesOut1DArray[], 
    unsigned int nTracesOut, unsigned int nTracesIn, unsigned int samplesPerT, 
    size_t inXsamples, size_t outXsamples, 
    unsigned int localThreadCount) 
{ 

    cl_int err; 

    // Get the source code 
    char* modifiedGpuSource = ModifySourceText(nTracesIn, samplesPerT); 

    // Allocate device memory 
    // CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR | CL_MEM_USE_PERSISTENT_MEM_AMD (?) 
    // Input... 
    cl_mem prestackTracesCL = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 
    inXsamples * sizeof(cl_float), prestackTracesArray, &err); 
    if (err) FatalBufferCreation("Prestack traces", err); 
    // Output... TODO: How do we know that the output is zeroed out? 
    cl_mem stackTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY, 
    outXsamples * sizeof(cl_float), NULL, &err); 
    if (err) FatalBufferCreation("Stack traces", err); 
    cl_mem powerTracesOutCL = clCreateBuffer(_context, CL_MEM_WRITE_ONLY, 
    outXsamples * sizeof(cl_float), NULL, &err); 
    if (err) FatalBufferCreation("Power traces", err); 

    // Compile the source code 
    char* gpuSourceText[1]; 
    gpuSourceText[0] = modifiedGpuSource; 
    size_t sourceLength[1]; 
    sourceLength[0] = strlen(modifiedGpuSource); 
    cl_program moveoutAndStackCLProgram = clCreateProgramWithSource(_context, 1, (const char**)gpuSourceText, 
    (const size_t*)sourceLength, &err); 
    if (err != CL_SUCCESS) { 
    if (err == CL_INVALID_CONTEXT) Fatal("Cannot create program: CL_INVALID_CONTEXT\n"); 
    else if (err == CL_INVALID_VALUE) Fatal("Cannot create program: CL_INVALID_VALUE\n"); 
    else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create program: CL_OUT_OF_HOST_MEMORY\n"); 
    else Fatal("Cannot create program_S %d\n", err); 
    } 

    // Build the program 
    cl_int buildCode = clBuildProgram(moveoutAndStackCLProgram, 0, NULL, NULL, NULL, NULL); 
    if (buildCode != CL_SUCCESS) { 
    // Attempt to get compile errors 
    char log[1048576]; 
    if (clGetProgramBuildInfo(moveoutAndStackCLProgram, _deviceID, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL)) { 
     log[0] = '\0'; // Failed to get the log file 
    } 

    if (buildCode == CL_INVALID_PROGRAM) Fatal("Cannot build program: CL_INVALID_PROGRAM\n%s", log); 
    else if (buildCode == CL_INVALID_VALUE) Fatal("Cannot build program: CL_INVALID_VALUE\n%s", log); 
    else if (buildCode == CL_INVALID_DEVICE) Fatal("Cannot build program: CL_INVALID_DEVICE\n%s", log); 
    else if (buildCode == CL_INVALID_BINARY) Fatal("Cannot build program: CL_INVALID_BINARY\n%s", log); 
    else if (buildCode == CL_INVALID_BUILD_OPTIONS) Fatal("Cannot build program: CL_INVALID_BUILD\n_OPTIONS\n%s", log); 
    else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log); 
    else if (buildCode == CL_COMPILER_NOT_AVAILABLE) Fatal("Cannot build program: CL_COMPILER_NOT_AVAILABLE\n%s", log); 
    else if (buildCode == CL_BUILD_PROGRAM_FAILURE) Fatal("Cannot build program: CL_BUILD_PROGRAM_FAILURE\n%s", log); 
    else if (buildCode == CL_INVALID_OPERATION) Fatal("Cannot build program: CL_INVALID_OPERATION\n%s", log); 
    else if (buildCode == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot build program: CL_OUT_OF_HOST_MEMORY\n%s", log); 
    else Fatal("Cannot build program: %d\n%s", buildCode, log); 
    } 

    // Compile the source code & build the kernel 
    cl_kernel kernel = clCreateKernel(moveoutAndStackCLProgram, "sumAllCL", &err); 
    if (err) { 
    if (err == CL_INVALID_PROGRAM) Fatal("Cannot create kernel: CL_INVALID_PROGRAM\n"); 
    else if (err == CL_INVALID_PROGRAM_EXECUTABLE) Fatal("Cannot create kernel: CL_INVALID_PROGRAM_EXECUTABLE\n"); 
    else if (err == CL_INVALID_KERNEL_NAME) Fatal("Cannot create kernel: CL_INVALID_KERNEL_NAME\n"); 
    else if (err == CL_INVALID_KERNEL_DEFINITION) Fatal("Cannot create kernel: CL_INVALID_KERNEL_DEFINITION\n"); 
    else if (err == CL_INVALID_VALUE) Fatal("Cannot create kernel: CL_INVALID_VALUE\n"); 
    else if (err == CL_OUT_OF_HOST_MEMORY) Fatal("Cannot create kernel: CL_OUT_OF_HOST_MEMOR\n"); 
    else Fatal("Cannot create kernel: %d\n", err); 
    } 

    // Set program parameters 
    cl_int returnValArgSet; 
    returnValArgSet = clSetKernelArg(kernel, 0, sizeof(cl_mem), &prestackTracesCL); 
    if (returnValArgSet != CL_SUCCESS) FatalSetArgs("prestackTracesCL", returnValArgSet); 
    returnValArgSet = clSetKernelArg(kernel, 1, sizeof(cl_mem), &stackTracesOutCL); 
    if (returnValArgSet != CL_SUCCESS) FatalSetArgs("stackTracesOutCL", returnValArgSet); 
    returnValArgSet = clSetKernelArg(kernel, 2, sizeof(cl_mem), &powerTracesOutCL); 
    if (returnValArgSet != CL_SUCCESS) FatalSetArgs("powerTracesOutCL", returnValArgSet); 
    returnValArgSet = clSetKernelArg(kernel, 3, sizeof(unsigned int), &nTracesOut); 
    if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesOut", returnValArgSet); 
    returnValArgSet = clSetKernelArg(kernel, 4, sizeof(unsigned int), &nTracesIn); 
    if (returnValArgSet != CL_SUCCESS) FatalSetArgs("nTracesIn", returnValArgSet); 
    returnValArgSet = clSetKernelArg(kernel, 5, sizeof(unsigned int), &samplesPerT); 
    if (returnValArgSet != CL_SUCCESS) FatalSetArgs("samplesPerT", returnValArgSet); 

    // TODO: verbose 
    printf("About to run Kernel...\n"); 

    // Start timer TODO: move? 
    double runTime = GetTime(); 

    // Run the kernel (& also set the number of threads) 
    cl_event runEvent; 
    size_t Global[1] = { nTracesOut }; 
    size_t Local[1] = { localThreadCount }; 
    if (localThreadCount > 0) err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, Local, 0, NULL, &runEvent); 
    else err = clEnqueueNDRangeKernel(_queue, kernel, 1, NULL, Global, NULL, 0, NULL, &runEvent); 
    if (err) { 
    if (err == CL_INVALID_PROGRAM_EXECUTABLE) { 
     Fatal("Cannot run Kernel: No successfully built program executable available.\n"); 
    } else if (err == CL_INVALID_COMMAND_QUEUE) { 
     Fatal("Cannot run Kernel: Command_queue is not a valid command-queue.\n"); 
    } else if (err == CL_INVALID_KERNEL) { 
     Fatal("Cannot run Kernel: Kernel is not a valid kernel object.\n"); 
    } else if (err == CL_INVALID_CONTEXT) { 
     Fatal("Cannot run Kernel: Context associated with command_queue and kernel is not the same or if " 
     "the context associated with command_queue and events in event_wait_list are not the same.\n"); 
    } else if (err == CL_INVALID_KERNEL_ARGS) { 
     Fatal("Cannot run Kernel: Kernel argument values have not been specified.\n"); 
    } else if (err == CL_INVALID_WORK_DIMENSION) { 
     Fatal("Cannot run Kernel: work_dim is not a valid value (must be between 1 and 3).\n"); 
    } else if (err == CL_INVALID_WORK_GROUP_SIZE) { 
     Fatal("Cannot run Kernel: local_work_size is specified and number of work-items specified by global_work_size " 
     "is not evenly divisable by size of work-group given by local_work_size or does not match the " 
     "work-group size specified for kernel using the __attribute__((reqd_work_group_size(X, Y, Z))) " 
     "qualifier in program source.\n"); 
    } else if (err == CL_INVALID_WORK_ITEM_SIZE) { 
     Fatal("Cannot run Kernel: If the number of work-items specified in any of local_work_size[0], ... " 
     "local_work_size[work_dim - 1] is greater than the corresponding values specified " 
     "by CL_DEVICE_MAX_WORK_ITEM_SIZES[0], .... CL_DEVICE_MAX_WORK_ITEM_SIZES[work_dim - 1]. .\n"); 
    } else if (err == CL_INVALID_GLOBAL_OFFSET) { 
     Fatal("Cannot run Kernel: Global_work_offset is not NULL.\n"); 
    } else if (err == CL_OUT_OF_RESOURCES) { 
     Fatal("Cannot run Kernel: CL_OUT_OF_RESOURCES.\n"); 
    } else if (err == CL_MEM_OBJECT_ALLOCATION_FAILURE) { 
     Fatal("Cannot run Kernel: Failure to allocate memory for data store associated with image or buffer " 
     "objects specified as arguments to kernel.\n"); 
    } else if (err == CL_INVALID_EVENT_WAIT_LIST) { 
     Fatal("Cannot run Kernel: event_wait_list is NULL and num_events_in_wait_list > 0, or event_wait_list " 
     "is not NULL and num_events_in_wait_list is 0, or if event objects in event_wait_list " 
     "are not valid events..\n"); 
    } else if (err == CL_OUT_OF_HOST_MEMORY) { 
     Fatal("Cannot run Kernel: Failure to allocate resources required by the OpenCL implementation on the host.\n"); 
    } else { 
     Fatal("Cannot run Kernel: Unknown Error. (clEnqueueNDRangeKernel)"); 
    } 
    } 

    // Flush the program & wait for the program to finish executing 
    if (clFlush(_queue)) printf("Flush Fail (Run)"); 
    WaitForEventAndRelease(&runEvent); 

    // Copy the end result back to CPU memory side 
    if (clEnqueueReadBuffer(_queue, stackTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), stackTracesOut1DArray, 0, NULL, NULL)) 
    Fatal("Cannot copy stackTracesOutCL from device to host\n"); 
    if (clEnqueueReadBuffer(_queue, powerTracesOutCL, CL_TRUE, 0, outXsamples * sizeof(cl_float), powerTracesOut1DArray, 0, NULL, NULL)) 
    Fatal("Cannot copy powerTracesOutCL from device to host\n"); 

    // Release kernel and program 
    if (clReleaseKernel(kernel)) Fatal("Cannot release kernel\n"); 
    if (clReleaseProgram(moveoutAndStackCLProgram)) Fatal("Cannot release program\n"); 

    // Free device memory 
    clReleaseMemObject(prestackTracesCL); 
    clReleaseMemObject(stackTracesOutCL); 
    clReleaseMemObject(powerTracesOutCL); 

    // Release the context and queue 
    clReleaseCommandQueue(_queue); 
    clReleaseContext(_context); 

    // Return the time it took to run this program 
    return runTime; 
} 

double RunProg(unsigned int samplesPerTrace, unsigned int nTracesIn, unsigned int nTracesOut, 
    unsigned int localThreadCount, unsigned int deviceType) { 

    // Stores sizes of the various arrays 
    size_t tracesInxSample = nTracesIn * samplesPerTrace; 
    size_t tracesOutxSample = nTracesOut * samplesPerTrace; 

    // Allocate arrays 
    float* prestackTraces1D = (float*)malloc(tracesInxSample * sizeof(float)); 
    float* stackTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out 
    float* powerTracesOut1Dgpu = (float*)calloc(tracesOutxSample, sizeof(float)); // output; zero-out 

    // Count how much memory all of this is 
    if (_VERBOSE) 
    { 
    // Make sure it is consistent with above allocation 
    unsigned long allocatedMemory = 0; 
    allocatedMemory += tracesInxSample * sizeof(float); 
    allocatedMemory += tracesOutxSample * sizeof(float); 
    allocatedMemory += tracesOutxSample * sizeof(float); 

    printf("TOTAL MEMORY ALLOCATED: %s\n", byteConverter(allocatedMemory)); 
    printf("Input Array Sizes: %s\n", byteConverter((unsigned int)(tracesInxSample * sizeof(float)))); 
    printf("Output Array Sizes: %s\n", byteConverter((unsigned int)(tracesOutxSample * sizeof(float)))); 
    } 

    // Fill in array with randoms 
    RandomFillArray(prestackTraces1D, (unsigned int)tracesInxSample); 

    // Init OpenCL using the desired device type 
    double preInitTime = GetTime(); 
    int maxWorkGroupSize; 
    if (deviceType == 0) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_ALL); 
    else if (deviceType == 1) maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_GPU); 
    else maxWorkGroupSize = InitOpenCL(_VERBOSE, CL_DEVICE_TYPE_CPU); 
    printf("Max work size for the device is: %d\n", maxWorkGroupSize); 

    // --- ACTUAL TEST --- 
    // Run OpenCL 
    double startTime = GetTime(); 
    double runTime = RunOpenCL(prestackTraces1D, stackTracesOut1Dgpu, powerTracesOut1Dgpu, // arrays 
    nTracesOut, nTracesIn, samplesPerTrace, // ints 
    tracesInxSample, tracesOutxSample, 
    localThreadCount); // samples 

    // Display run time 
    double endTime = GetTime(); 
    printf("Elapsed Time:  %fsecs\n", (endTime - runTime)); 
    printf("     %fsecs (Before Function Call)\n", (endTime - startTime)); 
    printf("     %fsecs (Including Init)\n\n", (endTime - preInitTime)); 

    // Free the 1D arrays 
    free(powerTracesOut1Dgpu); 
    free(stackTracesOut1Dgpu); 
    free(prestackTraces1D); 

    return (endTime - startTime); 
}

我首先想到的，为什么它的运行，所以在我的GPU比我的CPU慢得多的是，也许是因为我在显卡衬套这么多的数据的任何运行之前。也许更好的实现会涉及到多次运行中的工作负载分解，以便代码可以在更多数据被汇总时执行（我认为这是一件事情）。但是现在我认为这几乎肯定是错误的，因为正如我所说我基于一个示例编写了该程序，并且该示例执行了矩阵乘法，并且该示例在GPU上的运行速度比我的CPU快得多。我真的不知道有什么不同。

来源

2016-10-10 danglingPointer

请发布一个最简单的例子 – Jovasa

在总共320个内核的5个计算单元的gpu上，需要1秒钟的时间？= 10和＃= 250以及参数25000 2500 250。你的GPU有44个计算单元。 –

这里有一个简单的例子：https：//forums.khronos.org/showthread。php/13242-why-my-program-run-our-my-CPU-device-than-my-GPU-device – danglingPointer

问题在于缓存;我从阵列中读了很多东西。所以我写了一个版本，在写入数组之前尽可能多地写入局部变量，现在它在GPU上运行速度更快。

来源

2016-11-08 18:22:56 danglingPointer

也是数据量太小的软件，当你拥有大量数据时，GPU会发挥最好的效果 – Mgetz

为什么我的程序在我的CPU设备上运行得比在我的GPU设备上快得多？

回答

相关问题