CUDA基准测试中的执行时间问题

我试图从一些CUDA Rodinia基准测试中了解它们的SM和内存使用情况，功耗等。为此，我同时执行基准测试和基本生成一个pthread到profile GPU执行使用NVML库。CUDA基准测试中的执行时间问题

问题是，基准测试的执行时间比使用分析器执行基准测试时的情况要高得多（大约3次），以防我不调用分析器。 CPU的频率调节调节器是用户空间，所以我不认为CPU的频率正在改变。是否由于GPU频率闪烁？以下是探查器的代码。

#include <pthread.h> 
#include <stdio.h> 
#include "nvml.h" 
#include "unistd.h" 
#define NUM_THREADS  1 

void *PrintHello(void *threadid) 
{ 
    long tid; 
    tid = (long)threadid; 
    // printf("Hello World! It's me, thread #%ld!\n", tid); 

nvmlReturn_t result; 
nvmlDevice_t device; 
nvmlUtilization_t utilization; 
nvmlClockType_t jok; 
unsigned int device_count, i,powergpu,clo; 
char version[80]; 
result = nvmlInit(); 
result = nvmlSystemGetDriverVersion(version,80); 
printf("\n Driver version: %s \n\n", version); 
result = nvmlDeviceGetCount(&device_count); 
printf("Found %d device%s\n\n", device_count, 
device_count != 1 ? "s" : ""); 
printf("Listing devices:\n"); 
result = nvmlDeviceGetHandleByIndex(0, &device); 

while(1) 

{ 
result = nvmlDeviceGetPowerUsage(device,&powergpu); 
result = nvmlDeviceGetUtilizationRates(device, &utilization); 
printf("\n%d\n",powergpu); 




     if (result == NVML_SUCCESS) 
     { 
      printf("%d\n", utilization.gpu); 
      printf("%d\n", utilization.memory); 
     } 
result=nvmlDeviceGetClockInfo(device,NVML_CLOCK_SM,&clo); 
if(result==NVML_SUCCESS) 
{ 
printf("%d\n",clo); 
} 
usleep(500000); 
} 


pthread_exit(NULL); 
} 

int main (int argc, char *argv[]) 
{ 
    pthread_t threads[NUM_THREADS]; 

int rc; 
    long t; 
    for(t=0; t<NUM_THREADS; t++){ 
     printf("In main: creating thread %ld\n", t); 
     rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t); 
     if (rc){ 
     printf("ERROR; return code from pthread_create() is %d\n", rc); 
     exit(-1); 
     } 
    } 

    /* Last thing that main() should do */ 
    pthread_exit(NULL); 

}

来源

2013-05-11 Vaibhav Sundriyal

当GPU处于空闲或睡眠状态时，可能需要大量时间来处理工作。当你运行你的“探查器代码”时，你正在将GPU拉出睡眠状态，所以你的基准测试可能会更快。你在这个问题上提供的数据太少，以至于很难推测正在发生的事情，甚至是你的观察结果。您可以尝试将GPU设置为持久性模式，这与运行“探查器代码”应具有相似的效果。顺便说一句，你似乎没有接受任何关于你以前的问题的答案。 – 2013-05-12 23:21:14

正如@RobertCrovella所说的，尝试将GPU设置为持久模式：因此，当没有活动客户端连接到GPU时，NVIDIA驱动程序会保持加载状态，并且避免显着的GPU初始化开销。在Linux上，可以通过执行'nvidia-smi -pm 1'（'0'来关闭它）来实现。你的GPU可能不支持这个选项。 – BenC 2013-05-13 02:02:33

Robert Crovella-使用nvidia-smi将GPU设置为持久模式需要root权限吗？我已经接受了你对前几个问题的回答。我不知道这样的事情存在。 – 2013-05-13 15:39:07

有了您的Profiler运行时，GPU（S）被拉出他们的睡眠状态（由于访问nvml API，这是从GPU的查询数据）。这使得他们对CUDA应用程序的响应速度更快，因此如果您计算整个应用程序的执行时间（例如，使用linux time命令），应用程序似乎运行得更“快”。

一个解决方案是使用nvidia-smi命令（使用nvidia-smi --help获取命令行帮助）将GPU放在“持久模式”中。

另一种解决方案是从应用程序内部执行定时，并从定时测量中排除CUDA启动时间，可能在定时开始之前执行cuda命令（如cudaFree(0);）。

来源

2014-02-10 05:40:05

CUDA基准测试中的执行时间问题

回答

相关问题