STL推力多重矢量变换？

我想知道是否有更有效的方法写a = a + b + c？STL推力多重矢量变换？

thrust::transform(b.begin(), b.end(), c.begin(), b.begin(), thrust::plus<int>()); 
thrust::transform(a.begin(), a.end(), b.begin(), a.begin(), thrust::plus<int>());

这工作，但有一种方式来获得同样的效果只用一行代码？我查看了示例中的saxpy实现，但是这使用了2个向量和一个常量值;

这样更有效吗？

struct arbitrary_functor 
{ 
    template <typename Tuple> 
    __host__ __device__ 
    void operator()(Tuple t) 
    { 
     // D[i] = A[i] + B[i] + C[i]; 
     thrust::get<3>(t) = thrust::get<0>(t) + thrust::get<1>(t) + thrust::get<2>(t); 
    } 
}; 


int main(){ 

    // allocate storage 
    thrust::host_vector<int> A; 
    thrust::host_vector<int> B; 
    thrust::host_vector<int> C; 

    // initialize input vectors 
    A.push_back(10); 
    B.push_back(10); 
    C.push_back(10); 

    // apply the transformation 
    thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin(), C.begin(), A.begin())), 
        thrust::make_zip_iterator(thrust::make_tuple(A.end(), B.end(), C.end(), A.end())), 
        arbitrary_functor()); 

    // print the output 
     std::cout << A[0] << std::endl; 

    return 0; 
}

来源

2011-09-22 Sharpie

这对我来说很不错。 –

a = a + b + c具有低算术强度（只有两个，每4个的存储器操作的算术运算），因此计算将要被存储的带宽约束。为了比较您提出的解决方案的效率，我们需要测量他们的带宽需求。

在第一溶液中以transform每次调用需要两个负载和一个商店每次调用plus。因此我们可以将每个transform调用的成本建模为3N，其中N是矢量a,b和c的大小。由于有两个调用transform，此解决方案的成本为6N。

我们可以用同样的方法建模第二个解决方案的成本。 arbitrary_functor的每个调用都需要三个加载和一个存储。因此，此解决方案的成本模型为4N，这意味着for_each解决方案应该比调用transform两次更有效。当N很大时，第二个解决方案应该比第一个执行6N/4N = 1.5x更快。

当然，您可以始终以类似的方式将zip_iterator与transform结合起来，以避免两次单独拨打transform。

来源

2011-09-25 03:59:45

这是一个非常优雅的分析，但我不禁想知道zip迭代器有多昂贵（我使用它很多，但我对它的工作原理或性能没有感觉）。这在这里有什么影响吗？ – talonmies

zip_iterator确实可以增加内核的占用空间，因为每个压缩的迭代器都需要寄存器资源。在这个例子中，A被重复地包含在zip中 - 一次作为源，一次作为目的地。一个稍微更精简的解决方案可能只能将其发送一次，但鉴于arbitary_functor如此简单，它不太可能有所作为。 –

STL推力多重矢量变换？

回答

相关问题