CUDA时钟（）导致零时钟周期

我想使用clock（）来比较不同的内核实现。我试图用一个简单的SAXPY例子来实现它，但它导致零时钟周期，这是不太可能的。CUDA时钟（）导致零时钟周期

我已经找到了一些关于如何实现clock（）的例子。 here和here。但不知何故转移到我的代码不起作用。

这里是我使用的代码：

/* SAXPY code example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ */ 

#include <stdio.h> 

// The declaration specifier __global__ defines a kernel. This code 
// will be copied to the device and will be executed there in parallel 
__global__ 
void saxpy(int n, float a, float *x, float *y, int *kernel_clock) 
{ 
    // The indexing of the single threads is done with the following 
    // code line 
    int i = blockIdx.x*blockDim.x + threadIdx.x; 

    clock_t start = clock(); 

    // Each thread is executing just one position of the arrays 
    if (i < n) y[i] = a*x[i] + y[i]; 

    clock_t stop = clock(); 

    kernel_clock[i] = (int) (stop-start); 
} 

int main(void) 
{ 
    // Clock cycles of threads 
    int *kernel_clock; 
    int *d_kernel_clock; 
    // Creating a huge number 
    int N = 1<<20; 
    float *x, *y, *d_x, *d_y; 
    // Allocate an array on the *host* of the size of N 
    x = (float*)malloc(N*sizeof(float)); 
    y = (float*)malloc(N*sizeof(float)); 
    kernel_clock = (int*)malloc(N*sizeof(int)); 

    // Allocate an array on the *device* of the size of N 
    cudaMalloc(&d_x, N*sizeof(float)); 
    cudaMalloc(&d_y, N*sizeof(float)); 
    cudaMalloc(&d_kernel_clock, N*sizeof(int)); 

    // Filling the array of the host 
    for (int i = 0; i < N; i++) { 
    x[i] = 1.0f; 
    y[i] = 2.0f; 
    } 

    // Copy the host array to the device array 
    cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); 
    cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); 
    cudaMemcpy(d_kernel_clock, kernel_clock, N*sizeof(int), cudaMemcpyHostToDevice); 

    // Perform SAXPY on 1M elements. The triple chevrons dedicates how 
    // the threads are grouped on the device 
    saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y, d_kernel_clock); 
    cudaDeviceSynchronize(); 

    // Copy the result from the device to the host 
    cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); 
    cudaMemcpy(kernel_clock, d_kernel_clock, N*sizeof(int), cudaMemcpyDeviceToHost); 

    // Calculate average clock time 
    float average_clock = 0; 
    for (int i = 0; i < N; i++) { 
     average_clock += (float) (kernel_clock[i]); 
    } 
    average_clock /= N; 

    // Display the time to the screen 
    printf ("Kernel clock cycles: %.4f\n", average_clock); 

    // Free the memory on the host and device 
    free(x); 
    free(y); 
    free(kernel_clock); 
    cudaFree(d_x); 
    cudaFree(d_y); 
    cudaFree(d_kernel_clock); 
}

此代码示例导致：

Kernel clock cycles: 0.0000

我不知道我做错了。所以我的问题是：我怎样才能得到一个合理的结果？

来源

2016-10-03 stebran

我没有看到任何错误检查。如果你用'cuda-memcheck'运行你的代码会发生什么？ –

'cuda-memcheck'提供0错误 '========错误摘要：0错误' – stebran

从你在你的问题联系到答案的一个引用

你也应该知道，编译器和汇编做执行指令重新排序，所以你可能要检查时钟电话在SASS输出（使用cuobjdump来检查）时，不要缠绕在彼此的旁边。

我相信这是你问题的根源。如果我与CUDA 8日发布工具包编译内核，然后拆卸与cuobjdump所产生的机器代码，我得到如下：

code for sm_52 
      Function : _Z5saxpyifPfS_Pi 
    .headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)" 
                          /* 0x001c4400fe0007f6 */ 
    /*0008*/     MOV R1, c[0x0][0x20];          /* 0x4c98078000870001 */ 
    /*0010*/   {   CS2R R7, SR_CLOCKLO;          /* 0x50c8000005070007 */ 
    /*0018*/     S2R R0, SR_CTAID.X;  }        /* 0xf0c8000002570000 */ 
                          /* 0x083fc400e3e007f0 */ 
    /*0028*/   {   CS2R R8, SR_CLOCKLO;          /* 0x50c8000005070008 */ 
    /*0030*/     S2R R2, SR_TID.X;  }         /* 0xf0c8000002170002 */ 
    /*0038*/     XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ;     /* 0x4f107f8000270003 */ 
                          /* 0x081fc400fec207f6 */ 
    /*0048*/     XMAD R2, R0.reuse, c[0x0] [0x8], R2;      /* 0x4e00010000270002 */ 
    /*0050*/     XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2;       /* 0x5b30011800370000 */ 
    /*0058*/     ISETP.GE.AND P0, PT, R0.reuse, c[0x0][0x140], PT;   /* 0x4b6d038005070007 */ 
                          /* 0x001fd400fc2007ec */ 
    /*0068*/     SHR R9, R0, 0x1f;           /* 0x3829000001f70009 */ 
    /*0070*/    @!P0 SHF.L.U64 R2, RZ, 0x2, R0;         /* 0x36f800400028ff02 */ 
    /*0078*/    @!P0 SHF.L.U64 R3, R0, 0x2, R9;         /* 0x36f804c000280003 */ 
                          /* 0x001fc040fe4207f6 */ 
    /*0088*/    @!P0 IADD R4.CC, R2.reuse, c[0x0][0x148];      /* 0x4c10800005280204 */ 
    /*0090*/    @!P0 IADD.X R5, R3.reuse, c[0x0][0x14c];       /* 0x4c10080005380305 */ 
    /*0098*/   { @!P0 IADD R2.CC, R2, c[0x0][0x150];        /* 0x4c10800005480202 */ 
    /*00a8*/    @!P0 LDG.E R4, [R4];  }         /* 0x0005c400fe400076 */ 
                          /* 0xeed4200000080404 */ 
    /*00b0*/    @!P0 IADD.X R3, R3, c[0x0][0x154];        /* 0x4c10080005580303 */ 
    /*00b8*/    @!P0 LDG.E R6, [R2];            /* 0xeed4200000080206 */ 
                          /* 0x001fd800fea007e1 */ 
    /*00c8*/     LEA R10.CC, R0, c[0x0][0x158], 0x2;       /* 0x4bd781000567000a */ 
    /*00d0*/     IADD R8, -R7, R8;           /* 0x5c12000000870708 */ 
    /*00d8*/     LEA.HI.X R9, R0, c[0x0][0x15c], R9, 0x2;     /* 0x1a17048005770009 */ 
                          /* 0x001fc008fe4007f1 */ 
    /*00e8*/     MOV R7, R9;             /* 0x5c98078000970007 */ 
    /*00f0*/    @!P0 FFMA R0, R4, c[0x0][0x144], R6;        /* 0x4980030005180400 */ 
    /*00f8*/   {   MOV R6, R10;            /* 0x5c98078000a70006 */ 
    /*0108*/    @!P0 STG.E [R2], R0;  }         /* 0x001ffc005e2001f2 */ 
                          /* 0xeedc200000080200 */ 
    /*0110*/     STG.E [R6], R8;            /* 0xeedc200000070608 */ 
    /*0118*/     EXIT;              /* 0xe30000000007000f */ 
                          /* 0x001f8000fc0007ff */ 
    /*0128*/     BRA 0x120;             /* 0xe2400fffff07000f */ 
    /*0130*/     NOP;              /* 0x50b0000000070f00 */ 
    /*0138*/     NOP;              /* 0x50b0000000070f00 */ 
      .................................

你可以看到时钟指令已被重新排序，以便他们被称为无任何他们之间的代码。对于许多（如果不是全部的话）经线运行此代码，这将导致零或非常接近零时钟测量。

来源

2016-10-03 14:32:42 talonmies

谢谢！我现在明白了这个问题，但是我在输出中看到了哪些行？ – stebran

CUDA时钟（）导致零时钟周期

回答

相关问题