__CUDA_ARCH__和内核调用在__host__ __device__功能

我碰到一个奇怪的效果就出来：__CUDA_ARCH__和内核调用在__host__ __device__功能

#define CUDA_ERR_CHECK(call) call 

#include <assert.h> 
#include <iostream> 

using namespace std; 

#if defined(__CUDACC__) 

// Determine the size of type on device. 
template<typename T> 
__global__ void deviceSizeOf(size_t* result) 
{ 
    *result = sizeof(T); 
} 

// Device memory aligned vector. 
template<typename T> 
class VectorDevice 
{ 
    T* data; 
    size_t size; 
    int dim, dim_aligned; 

public : 
    __host__ __device__ 
    VectorDevice() : data(NULL), size(0), dim(0) { } 

    __host__ __device__ 
    VectorDevice(int dim_) : data(NULL), size(0), dim(dim_) 
    { 
     dim_aligned = dim_; 
     if (dim_ % AVX_VECTOR_SIZE) 
      dim_aligned = dim + AVX_VECTOR_SIZE - dim_ % AVX_VECTOR_SIZE; 
#if !defined(__CUDA_ARCH__) 
     // Determine the size of target type. 
     size_t size, *dSize; 
     CUDA_ERR_CHECK(cudaMalloc(&dSize, sizeof(size_t))); 
     deviceSizeOf<T><<<1, 1>>>(dSize); 
     CUDA_ERR_CHECK(cudaGetLastError()); 
     CUDA_ERR_CHECK(cudaDeviceSynchronize()); 
     CUDA_ERR_CHECK(cudaMemcpy(&size, dSize, sizeof(size_t), cudaMemcpyDeviceToHost)); 
     CUDA_ERR_CHECK(cudaFree(dSize)); 

     // Make sure the size of type is the same on host and on device. 
     if (size != sizeof(T)) 
     { 
      std::cerr << "Unexpected unequal sizes of type T in VectorDevice<T> on host and device" << std::endl; 
      exit(2); 
     } 
#endif 
    } 
}; 

#endif // __CUDACC__ 

int main() 
{ 
    VectorDevice<int> v(10); 

    return 0; 
}

这里，内核正在从__host__ __device__构造的主机版本调用。出人意料的是，这段代码运行时，它静静地从内核调用包装代码1退出：

(gdb) make 
nvcc -arch=sm_30 test.cu -o test -DAVX_VECTOR_SIZE=32 
(gdb) b exit 
Breakpoint 1 at 0x7ffff711b1e0: file exit.c, line 104. 
(gdb) r 
Breakpoint 1, __GI_exit (status=1) at exit.c:104 
104 exit.c: No such file or directory. 
(gdb) f 3 
#3 0x0000000000402c36 in VectorDevice<int>::VectorDevice(int)() 
(gdb) f 2 
#2 0x0000000000402cb0 in void deviceSizeOf<int>(unsigned long*)() 
(gdb) f 1 
#1 0x0000000000402ad2 in void __wrapper__device_stub_deviceSizeOf<int>(unsigned long*&)() 
(gdb) disass 
Dump of assembler code for function _Z35__wrapper__device_stub_deviceSizeOfIiEvRPm: 
    0x0000000000402abc <+0>: push %rbp 
    0x0000000000402abd <+1>: mov %rsp,%rbp 
    0x0000000000402ac0 <+4>: sub $0x10,%rsp 
    0x0000000000402ac4 <+8>: mov %rdi,-0x8(%rbp) 
    0x0000000000402ac8 <+12>: mov $0x1,%edi 
    0x0000000000402acd <+17>: callq 0x402270 <[email protected]> 
End of assembler dump.

进一步的调查表明，内核代码中没有出现的cubin，而__CUDA_ARCH__以某种方式参与到这种行为。

所以，2个问题：

1）为什么会发生这种情况？

2）如何使用__CUDA_ARCH__进行条件编译__host__ __device__与主机端内核调用相结合的代码？

谢谢！

UPDATE：同样的例子中示出了C程序设计指南的部分E.2.2.1项目2。但是，目前还不清楚处理这个问题的正确方法。

来源

2016-12-06 Dmitry Mikushin

1）为什么会发生这种情况？

它发生，因为你是踩着the specific restriction你指着编程指南：中deviceSizeOf为<int>的模板实例化必须发生时都被定义__CUDA_ARCH__当没有定义它。如果您使用限制表单，则行为未定义。

2）如何使用__CUDA_ARCH__进行条件编译__host__ __device__代码与主机端内核调用结合使用？

一个可能的办法是强制为<int>类型的内核函数的实例不管__CUDA_ARCH__宏。

你可以通过你的内核模板定义后立即添加以下行做到这一点：

template __global__ void deviceSizeOf<int>(size_t *);

当我添加该行内核定义之后，并为AVX_VECTOR_SIZE一个合适的定义（这似乎是未定义在你的AFAICT例子中），你的代码为我编译和运行正确。

来源

2016-12-07 02:27:19

谢谢，@Robert！请在下面回顾我的答案，即不使用显式模板实例化。 –

我没有尝试过，但它看起来像你的方法应该工作。 –

我发现有可能解决这个问题没有明确的模板实例：

class VectorDevice 
{ 
    T* data; 
    size_t size; 
    int dim, dim_aligned; 

    struct A 
    { 
     __host__ 
     A() 
     { 
      bool neverCalled = true; 
      if (!neverCalled) 
      { 
       deviceSizeOf<T><<<1, 1>>>(NULL); 
       CUDA_ERR_CHECK(cudaGetLastError()); 
       CUDA_ERR_CHECK(cudaDeviceSynchronize()); 
      } 
     } 
    } a; 

public : 

    __host__ __device__ 
    VectorDevice() : data(NULL), size(0), dim(0) { } 

    #pragma hd_warning_disable \ 
    #pragma nv_exec_check_disable 
    __host__ __device__ 
    VectorDevice(int dim_) : data(NULL), size(0), dim(dim_) 
    { 
     ... 
    } 

    ... 
}; 

...

来源

2016-12-07 18:20:06

__CUDA_ARCH和内核调用在host device__功能

回答

__CUDA_ARCH__和内核调用在__host__ __device__功能

回答

相关问题

__CUDA_ARCH和内核调用在host device__功能