加快在推力比较运算符__device__功能

我跑的阵列上的推力并行二进制搜索型例程：加快在推力比较运算符__device__功能

// array and array2 are raw pointers to device memory 
thrust::device_ptr<int> array_ptr(array); 

// Search for first position where 0 could be inserted in array 
// without violating the ordering 
thrust::device_vector<int>::iterator iter; 
iter = thrust::lower_bound(array_ptr, array_ptr+length, 0, cmp(array2));

自定义功能对象cmp定义自定义比较运算符：

struct cmp 
{ 
    cmp(int *array2){ this->array2 = array2; } 

    __device__ bool operator()(const int& x, const int& y) 
    { 
     return device_function(array2,x) <= device_function(array2,y); 
    } 

    int *array2; 
};

比较依赖于调用设备上编译的函数：

__device__ int device_function(const int* array2, const int value){ 
    int quantity = 0; 

    for (int i = 0; i < 50000; ++i){ 
     if (array2[i] > value){ quantity += array2[i]; } 
    } 

    return quantity; 
}

我的问题是：什么（如果有的话）在设备上进行并行执行以减少总和device_function？如果函数是串行执行的，那么如何引入并行性来加速函数评估呢？

来源

2015-02-09 lodhb

我的问题是：什么（如果有的话）在设备上进行并行执行以减少device_function的总和？

无。 __device__函数中的普通C/C++代码（无论是在CUDA还是Thrust中）从单个CUDA线程的上下文中按顺序执行。

如果函数是这样串行执行的，我该如何引入并行性来加速函数评估？

一种可能的方法是使用推力V1.8（可从GitHub或CUDA 7 RC），并且把一个适当的推力功能，你传递给thrust::lower_bound仿函数（cmp）。

Here是一个使用thrust::sort从一个自定义仿函数传递给另一个推力函数的工作示例。

使用此方法的并行化需要编译和执行支持CUDA Dynamic Parallelism的设备。就像任何CUDA动态并行代码一样，不能保证整体加速。这种并行性水平是否会带来任何好处取决于许多因素，例如先前的并行性水平是否已经最大限度地利用该设备，或者不是。

出于示例目的，您的device_function中包含的功能似乎可以通过对thrust::transform_reduce的单个调用来取代。然后你cmp功能可以改写为这样的事情（在浏览器编码，未测试）：

struct cmp 
{ 
    cmp(int *array2){ this->array2 = array2; } 

    __device__ bool operator()(const int& x, const int& y) 
    { 
     return (thrust::transform_reduce(thrust::device, array2,array2+50000, my_greater_op(x), 0, thrust::plus<int>()) <= thrust::transform_reduce(thrust::device, array2,array2+50000, my_greater_op(y), 0, thrust::plus<int>())); 
    } 

    int *array2; 

};

，你必须提供一个合适的my_greater_op函子：

struct my_greater_op 
{ 
    int val; 
    my_greater_op(int _val) {val = _val;} 
    __host__ __device__ int operator(const int& x) 
    { 
    return (x>val)?x:0; 
    } 
};

来源

2015-02-09 21:13:31

只是尝试这样做，它实际上比我的原始设备device_function慢！由于某些原因，该算法会生成一个带有单个块和单个线程的网格。任何想法为什么它的行为如此？ – lodhb 2015-04-25 14:09:58

加快在推力比较运算符device功能

回答

加快在推力比较运算符__device__功能

回答

相关问题

加快在推力比较运算符device功能