使用可变范围时，循环未被矢量化

如何使版本A向量化并保持变量范围（不使用文字范围）？

嵌套循环是用于乘法与广播，如在Python和MATLAB的numpy库。 numpy图书馆的广播描述是here。

版A码（没有的std ::向量。无向量化。）

此只使用imull (%rsi), %edx在.L169，这不是一个SIMD指令。

#include <iostream> 
#include <stdint.h> 
typedef int32_t DATA_TYPE; 
template <size_t N> 
size_t cal_size(size_t (&Ashape)[N]){ 
    size_t size = 1; 
    for(size_t i = 0; i < N; ++i) size *= Ashape[i]; 
    return size; 
} 
template <size_t N> 
size_t * cal_stride(size_t (&Ashape)[N]) { 
    size_t size = cal_size(Ashape); 

    size_t * Astride = new size_t[N]; 
    Astride[0] = size/Ashape[0]; 
    for(size_t i = 1; i < N; ++i){ 
     Astride[i] = Astride[i-1]/Ashape[i]; 
    } 
    return Astride; 
} 
template <size_t N> 
DATA_TYPE * init_data(size_t (&Ashape)[N]){ 
    size_t size = cal_size(Ashape); 
    DATA_TYPE * data = new DATA_TYPE[size]; 
    for(size_t i = 0; i < size; ++i){ 
     data[i] = i + 1; 
    } 
    return data; 
} 

template <size_t N> 
void print_data(DATA_TYPE * Adata, size_t (&Ashape)[N]){ 
    size_t size = cal_size(Ashape); 
    for(size_t i = 0; i < size; ++i){ 
     std::cout << Adata[i] << ", "; 
    } 
    std::cout << std::endl; 
} 

int main(void){ 
    constexpr size_t nd = 3; 
    size_t Ashape[] = {20,3,4}; 
    size_t Bshape[] = {1,3,1}; 
    auto Astride = cal_stride(Ashape); 
    auto Bstride = cal_stride(Bshape); 

    auto Adata = init_data(Ashape); 
    auto Bdata = init_data(Bshape); 

    size_t c[nd] = {0,0,0}; 
     ///counter 
    size_t hint[nd] = {0,2,0}; 
     //hint tells which are the broadcasting axes. 
    size_t A_i, B_i; 
    for(c[0] = 0; c[0] < Ashape[0]; ++c[0]){ // Ashape as hint[0] = 0 
     for(c[1] = 0; c[1] < Bshape[1]; ++c[1]){ //Bshape as hint[1] = 2 
      for(c[2] = 0; c[2] < Ashape[2];++c[2]){ //Asape as hint[2] = 0 
       A_i = c[0]*Astride[0] + c[1]*Astride[1] + c[2]*Astride[2]; 
       B_i = c[1]*Bstride[1]; 
       Adata[A_i] *= Bdata[B_i]; 
      } 
     } 
    } 
    print_data(Adata, Ashape); 
}

版本B代码（没有的std ::向量。字面盘区，并且这向量化）

这.L20使用pmulld %xmm3, %xmm2，这是SIMD指令。

gcc godbolt

#include <iostream> 
#include <stdint.h> 
typedef int32_t DATA_TYPE; 


void print_data(DATA_TYPE * Adata, size_t size){ 
    for(size_t i = 0; i < size; ++i){ 
     std::cout << Adata[i] << ", "; 
    } 
    std::cout << std::endl; 
} 

int main(void){ 
    int32_t Adata[240]; 
    int32_t Bdata[3]; 
    size_t A_i, B_i, i,j,k; 
    for(i = 0; i < 20; ++i){ 
     for(j = 0; j < 3; ++j){ 
      for(k = 0; k < 4;++k){ 
       A_i = i*12 + j*4 + k*1; 
       B_i = j*1; 
       Adata[A_i] *= Bdata[B_i]; 
      } 
     } 
    } 
    print_data(Adata, 240); 
}

提升多阵列矢量化，但为什么呢？不确定它是否使用SIMD对齐存储器。

gcc godbolt

#include "boost/multi_array.hpp" 
#include <iostream> 

int 
main() { 
    // Create a 3D array that is 3 x 4 x 2 
    int d1,d2,d3; 
    typedef boost::multi_array<int, 3> array_type; 
    typedef array_type::index index; 
    array_type A(boost::extents[d1][d2][d3]); 
    array_type B(boost::extents[1][d2][1]); 
    // Assign values to the elements 

    for(index i = 0; i != d1; ++i) 
    for(index j = 0; j != d2; ++j) 
     for(index k = 0; k != d3; ++k) 
     A[i][j][k] *= B[0][j][0]; 

    for(index i = 0; i != d1; ++i) 
    for(index j = 0; j != d2; ++j) 
     for(index k = 0; k != d3; ++k) 
     std::cout << A[i][j][k]; 
    return 0; 
}

2004 pdf在描述GCC的一些循环优化gcc.gnu.org。我希望“符号Chrecs”（对应于未分析的变量）意味着gcc可以将嵌套循环与可变范围融合。

最后的手段是与元编程实现循环融合。

来源

2016-09-28 hamster on wheels

他编译器可以向量化可变范围的循环，如Richard Hodges对我之前问题的回答所述http://stackoverflow.com/questions/39724268/can-compiler-optimize-loop-with-variable-length –

可能是我下次应该尝试Fortran。或者在make文件中指定数组的维数，并在每次更改数组的维度时重新编译程序......这意味着所有循环都将使用文本范围。 –

为了测试依赖关系之间内和跨循环迭代存储器访问，阵列索引（在编译的多面体模型），应具有的形式i*c0 + j*c1 + k*c2 + ...，其中i, j, k ...是循环计数器和c0, c1, c2 ...是积分常数。

在你的例子中，显然问题是c[2] * Astride[2]，因为Astride[2]不是一个常量。

实际上，离开表达

A_i = c[0]*Astride[0] + c[1]*Astride[1] + c[2];

允许环被矢量化。

https://godbolt.org/g/gOUVfE

（编辑：注意X = Adata + c[0]*Astride[0] + c[1]*Astride[1]是循环不变，所以内部循环之内，我们刚才阵列X和索引c[0]）。

现在，问题是如何在AStride[2]处滴一个常量。幸运的是，从你在cal_stride的计算看来，最后一个总是1.所以我们应该完成。

来源

2016-09-30 12:57:43 chill

确实只有最内层的循环是矢量化的吗？ –

使用可变范围时，循环未被矢量化

回答

相关问题