2016-12-28 51 views
0

我有一个要由OpenACC加速的任务。我需要在内核计算中进行动态内存分配。我为它构建了一个更简单的演示,如下所示。openacc在内核中运行时创建数据

#include <iostream> 

using namespace std; 

#pragma acc routine seq 
int *routine(int init) { 
    int *ptr; 
    #pragma acc data create(ptr[:10]) 
    for (int i = 0; i < 10; ++i) { 
     ptr[i] = init + i; 
    } 
    return ptr; 
} 

void print_array(int *arr) { 
    for (int i = 0; i < 10; ++i) { 
     cout << arr[i] << " "; 
    } 
    cout << endl; 
} 

int main(void) { 
    int *arrs[5]; 

#pragma acc kernels 
    for (int i = 0; i < 5; ++i) { 
     arrs[i] = routine(i); 
    } 

    for (int i = 0; i < 5; ++i) { 
     print_array(arrs[i]); 
    } 
    return 0; 
} 

在这个演示中,我试图在内核结构内运行时调用例程。例行程序希望在GPU内创建一些数据并将其中的一些值添加进去。

尽管我可以编译代码,但它会将运行时问题报告如下。

[email protected]:create_and_copyout$ pgc++ -o test main.cc -acc -Minfo=accel 
routine(int): 
     6, Generating acc routine seq 
main: 
    23, Generating implicit copyout(arrs[:]) 
    26, Accelerator restriction: size of the GPU copy of arrs is unknown 
     Loop is parallelizable 
     Generating implicit copy(arrs[:][:]) 
     Accelerator kernel generated 
     Generating Tesla code 
     26, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */ 
[email protected]:create_and_copyout$ ./test 
call to cuStreamSynchronize returned error 715: Illegal instruction 

我在想我应该怎么做才能完成这个任务(在内核构造的处理中动态分配内存)。真的很感谢你,如果你能帮助。

回答

0

这是未经测试的,可能非常缓慢,但这可能会做你所需要的。

int main() { 
    const int num = 20; 
    int a[x] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}; 
    int* sizes = (int *)malloc(num * sizeof(int)); 
    int *ptrs[num]; 
    int* temp, *temp2; 
    int sum; 
    int* finished = (int *)malloc(num * sizeof(int)); 
    for (int x = 0; x < num; ++x){ 
     finished[x] = 0; 
    } 
    #pragma acc kernels copyin(a[0:10]) copyout(ptrs[:num][:1]) async(num*2+1) 
    { 
     #pragma acc loop private(temp) 
     for (int i = 0; i < num; ++i){ 
      #pragma acc loop seq async(i) 
      for (int j = 0; j < 1; ++j){ 
       temp = ptrs[x]; 
       sizes[i] = ... 
      } 
      while (ptrs[x] != x); 
      ptrs[x] = routine(a, sizes[i]); 
     } 
    } 

    while (true){ 
     sum = 0; 
     for (int x = 0; x < num; ++x){ 
      sum += finished[x]; 
     } 
     if (sum == num){ 
      break; 
     } 
     for (int x = 0; x < num; ++x){ 
      if (acc_async_test(x) != 0 && finished[x] == 0){ 
       finished[x] = 1; 
       #pragma acc update host(sizes[x:1]) 
       temp = (int *)malloc(size[x] * sizeof(int)); 
       #pragma acc enter data copyin(temp[0:x]) 
       temp2 = acc_deviceptr(temp); 
       ptrs[x] = temp2; 
       #pragma acc update device(ptrs[x:1][0:1]) 
      } 
     } 
    } 
} 
+0

谢谢凯尔。但是,性能至关重要。我正在考虑放弃一些结果,只保留最好的结果。 –

相关问题