如何使用OpenMP提供的GPU？

我想获得一些代码，使用OpenMP在GPU上运行，但我没有成功。在我的代码中，我使用for循环执行矩阵乘法：一次使用OpenMP pragma标记，一次没有。（这样我就可以比较执行时间了。）在第一个循环之后，我调用omp_get_num_devices()（这是我的主要测试，看看我是否实际连接到GPU）。无论我尝试什么，omp_get_num_devices()始终返回0如何使用OpenMP提供的GPU？

我正在使用的计算机有两个NVIDIA Tesla K40M GPU。 CUDA 7.0和CUDA 7.5作为模块在计算机上提供，并且CUDA 7.5模块通常处于活动状态。 gcc 4.9.3，5.1.0和7.1.0都可以作为模块使用，gcc 7.1.0模块通常处于活动状态。我正在编写我的代码$ g++ -fopenmp -omptargets=nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting。我已经成功使用CPU并行处理了OpenMP代码，但没有使用GPU。

我的主要目标是让omp_get_num_devices()返回2，以证明我可以在OpenMP中检测和使用GPU。我在这里接受任何帮助将不胜感激。

这里是我使用的检查，如果被正确或不使用的GPU代码：

#include <omp.h> 
#include <fstream> 
#include <stdio.h> 
#include <math.h> 
#include <stdlib.h> 
#include <time.h> 
#include <iomanip> 
#include <cstdio> 
#include <stdlib.h> 
#include <iostream> 
#include <time.h> 
using namespace std; 

double A [501][501]; 
double B [501][501]; 
double C [501][501][501]; 
double D [501][501]; 
double E [501][501]; 
double F [501][501][501]; 
double dummyvar; 
int Mapped [501]; 

int main() { 
    int i, j, k, l, N, StallerGPU, StallerCPU; 

    // 
    N = 500; 

    // Variables merely uses to make the execution take longer and to 
    // exaggurate the difference in performance between first and second 
    // calculation 
    StallerGPU = 200; 
    StallerCPU = 200; 

    std::cout << " N = " << N << "\n"; 
    // generate matrix to be used in first calculation 
    for (i=0; i<N; i++) { 
     for (k=0; k<N; k++) { 
      if (i == k) { 
       A[i][k] = i+1; 
      } else { 
       A[i][k] = i * k/N; 
      } 
     } 
    } 
    // generate other matrix to be used for the first calculation 
    for (k=0; k<N; k++) { 
     for (j=0; j<N; j++) { 
      B[k][j] = 2*(N-1)-k-j; 
     } 
    } 

// Slightly adjusted matrices for second calculation 
    for (i=0; i<N; i++) { 
     for (k=0; k<N; k++) { 
      if (i == k) { 
       D[i][k] = i+2; 
      } else { 
       D[i][k] = i * k/N - 1; 
      } 
     } 
    } 

    for (k=0; k<N; k++) { 
     for (j=0; j<N; j++) { 
      E[k][j] = 2*(N+1)-k-j; 
     } 
    } 

    dummyvar = 0; 

    //Run the multiplication in parallel using GPUs 

    double diff; 
    time_t time1; 
    time1 = time(NULL); // CPU time counter 
    cout << endl << " GPU section begins at " << ctime(&time1) << endl; 

     // This pragma is frequently changed to try different tags 
     #pragma omp for collapse(4) private(i, j, k, l) 

     for (i=0; i<N; i++) { 
//   Mapped[i] = omp_is_initial_device(); 
      for (j=0; j<N; j++) { 
       for (k=0; k<N; k++) { 
        for(l = 0; l < StallerGPU; l++) { 
         C[i][j][k] = A[i][k] * B[k][j] ; 
         dummyvar += A[i][k] * B[k][j] * (l + 1); 
        } 
       } 
//   cout << " i " << i << endl; 
      } 
     } 


    //record the time it took to run the multiplication  
    time_t time2 = time(NULL); 
    cout << " number of devices: " << omp_get_num_devices() << endl; 
    cout << " dummy variable: " << dummyvar << endl; 

    float cpumin = difftime(time2,time1); 
    diff = difftime(time2,time1); 
    cout << " stopping at delta GPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time2) << endl; 
    cout << " GPU time elasped " << diff << " s" << endl; 
    cout << endl; 

    dummyvar = 0; 
    time_t time3 = time(NULL); 
    cout << endl << " CPU section begins at " << ctime(&time3) << endl; 
// #pragma omp single 
    for (i=0; i<N; i++) { 
     for (j=0; j<N; j++) { 
      for (k=0; k<N; k++) { 
       for (int l=0; l<StallerCPU; l++) { 
        F[i][j][k] = D[i][k] * E[k][j]; 
        dummyvar += D[i][k] * E[k][j] * (l - 1); 
       } 
      } 
     } 
    } 
    // the sum to complete the matrix calculation is left out here, but would 
    // only be used to check if the result of the calculation is correct 

    time_t time4 = time(NULL); 
    cpumin = difftime(time4,time3); 
    diff = difftime(time4,time3); 
    cout << " dummy variable: " << dummyvar << endl; 
    cout << " stopping at delta CPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time4) << endl; 
    cout << " CPU time elasped " << diff << " s" << endl; 
    //Compare the time it took to confirm that we actually used GPUs to parallelize. 
}

这里是运行DEVICEQUERY样本CUDA代码的结果。

./deviceQuery Starting... 

CUDA Device Query (Runtime API) version (CUDART static linking) 

Detected 2 CUDA Capable device(s) 

Device 0: "Tesla K40m" 
    CUDA Driver Version/Runtime Version   7.5/7.5 
    CUDA Capability Major/Minor version number: 3.5 
    Total amount of global memory:     11520 MBytes (12079136768 bytes) 
    (15) Multiprocessors, (192) CUDA Cores/MP:  2880 CUDA Cores 
    GPU Max Clock rate:       745 MHz (0.75 GHz) 
    Memory Clock rate:        3004 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         1572864 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Enabled 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/130/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 

Device 1: "Tesla K40m" 
    CUDA Driver Version/Runtime Version   7.5/7.5 
    CUDA Capability Major/Minor version number: 3.5 
    Total amount of global memory:     11520 MBytes (12079136768 bytes) 
    (15) Multiprocessors, (192) CUDA Cores/MP:  2880 CUDA Cores 
    GPU Max Clock rate:       745 MHz (0.75 GHz) 
    Memory Clock rate:        3004 Mhz 
    Memory Bus Width:        384-bit 
    L2 Cache Size:         1572864 bytes 
    Maximum Texture Dimension Size (x,y,z)   1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) 
    Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers 
    Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers 
    Total amount of constant memory:    65536 bytes 
    Total amount of shared memory per block:  49152 bytes 
    Total number of registers available per block: 65536 
    Warp size:          32 
    Maximum number of threads per multiprocessor: 2048 
    Maximum number of threads per block:   1024 
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64) 
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) 
    Maximum memory pitch:       2147483647 bytes 
    Texture alignment:        512 bytes 
    Concurrent copy and kernel execution:   Yes with 2 copy engine(s) 
    Run time limit on kernels:      No 
    Integrated GPU sharing Host Memory:   No 
    Support host page-locked memory mapping:  Yes 
    Alignment requirement for Surfaces:   Yes 
    Device has ECC support:      Enabled 
    Device supports Unified Addressing (UVA):  Yes 
    Device PCI Domain ID/Bus ID/location ID: 0/131/0 
    Compute Mode: 
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > 
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes 
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes 

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m 
Result = PASS

来源

2017-06-21 Josiah

你可以上传一个最低工作示例，显示你正在尝试做什么？ – Richard

欢迎来到Stack Overflow！你的帖子不幸遗失了[mcve]。请访问[帮助中心]（http://stackoverflow.com/help）并阅读[如何提出一个好问题]部分（http://stackoverflow.com/help/how-to-ask）。 –

我添加了我的测试代码。 – Josiah

GCC 4.9.3和5.1.0绝对不支持OpenMP卸载到GPU。 GCC 7.1.0确实支持它，但它应该使用特殊配置选项as described here来构建。

来源

2017-06-27 21:52:21

这解决了我的问题！非常感谢！！！ – Josiah

也许我在一个错误的方向。但我想帮助，因为我曾经在使用GPU的奇怪的情况下，

。

您需要位于linux的“视频”组，因此您可以使用GPU。

或全部结果从GPU返回将是0

所以我会建议你运行示例代码CUDA来检查，如果你是在我以前被卡住的情况。

这很奇怪。我不确定我是否正确描述了它。希望它有帮助。

根据本：https://wiki.gentoo.org/wiki/NVidia/nvidia-drivers

无需访问视频卡用户（S）将需要添加到视频组

来源

2017-06-21 22:34:27

你为什么不提供链接引用？这听起来是一个很好的答案，这听起来会更好 – Thecave3

http://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Install.aspx –

我将添加运行的结果deviceQuery示例CUDA代码。 – Josiah

我可能是错的，但我认为你需要对发布的代码进行一些更正（也许你已经知道了）。要真正在使用OpenMP的GPU目标运行，你需要更换：

#pragma omp for collapse(4) private(i, j, k, l)

与

#pragma omp target teams distribute parallel for collapse(4) private(i, j, k, l)

您可以验证如果内核实际上是在GPU上用“nvprof”剖析你的可执行文件运行。它应该显示在GPU上执行的内核。您还可以使用'num_teams'和'thread_limit'子句更改目标区域中的团队和线程数量，并且您应该在您的配置文件中看到相应的更改。

要以编程方式实际检查目标区域是否在目标设备上运行，我使用'omp_is_initial_device（）'调用，该调用在从加速器调用时返回0。下面是一个例子：

int A[1] = {-1}; 
#pragma omp target 
{ 
    A[0] = omp_is_initial_device(); 
} 

if (!A[0]) { 
    printf("Able to use offloading!\n"); 
}

来源

2017-06-22 14:53:03

我试图按照你的建议用'nvprof'来描述它。程序完成其执行后，我收到一个错误'========警告：没有CUDA应用程序分析，退出'。当我添加'omp_is_initial_device（）'时，它每次都返回1。 – Josiah

这似乎强烈表明你的内核正在CPU上运行。正如Ilya提到的，你可能需要编译gcc以支持gpu。 –

为什么你需要使用一个元素的数组而不仅仅是一个简单的整数？我试过你的代码，它只适用于一个数组，但我不明白为什么。 –

如何使用OpenMP提供的GPU？

回答

相关问题