2011-02-16 78 views
5

我有以下的C程序(我的实际使用情况的简化表现出相同的行为)GCC为什么不自动矢量化这个循环?

#include <stdlib.h> 
#include <math.h> 
int main(int argc, char ** argv) { 
    const float * __restrict__ const input = malloc(20000*sizeof(float)); 
    float * __restrict__ const output = malloc(20000*sizeof(float)); 

    unsigned int pos=0; 
    while(1) { 
      unsigned int rest=100; 
      for(unsigned int i=pos;i<pos+rest; i++) { 
        output[i] = input[i] * 0.1; 
      } 

      pos+=rest;    
      if(pos>10000) { 
        break; 
      } 
    } 
} 

当我与

-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math 

编译我得到的输出

main.c:10: note: not vectorized: unhandled data-ref 

其中10是内循环的行。当我查询它为什么会这样说时,它似乎是说指针可能是别名,但它们不能在我的代码中,因为我有__restrict关键字。他们还建议包括-msse标志,但他们似乎也没有做任何事情。任何帮助?

+0

什么版本的gcc?一个可行的例子也可能是有用的,因为当我尝试使用4.4.5 – ergosys 2011-02-16 23:15:16

+0

进行向量化时,你可以发布编译的代码示例吗?当我填充了一些虚拟值时,循环被矢量化了...... – Christoph 2011-02-16 23:15:39

+0

@ergosys:他说的;) – Christoph 2011-02-16 23:16:00

回答

3

它肯定看起来像一个错误。在下文中,相同的功能,是foo()但矢量化是bar()不是,编译时对于x86-64的目标:

void foo(const float * restrict input, float * restrict output) 
{ 
    unsigned int pos; 
    for (pos = 0; pos < 10100; pos++) 
     output[pos] = input[pos] * 0.1; 
} 

void bar(const float * restrict input, float * restrict output) 
{ 
    unsigned int pos; 
    unsigned int i; 
    for (pos = 0; pos <= 10000; pos += 100) 
     for (i = 0; i < 100; i++) 
      output[pos + i] = input[pos + i] * 0.1; 
} 

添加-m32标志,编译为一个x86目标代替,导致要矢量化两种功能。

1

尝试:

const float * __restrict__ input = ...; 
float * __restrict__ output = ...; 

实验了一下周围改变事物:

#include <stdlib.h> 
#include <math.h> 

int main(int argc, char ** argv) { 

    const float * __restrict__ input = new float[20000]; 
    float * __restrict__ output = new float[20000]; 

    unsigned int pos=0; 
    while(1) { 
     unsigned int rest=100; 
     output += pos; 
     input += pos; 
     for(unsigned int i=0;i<rest; ++i) { 
      output[i] = input[i] * 0.1; 
     } 

     pos+=rest; 
     if(pos>10000) { 
      break; 
     } 
    } 
} 

g++ -O3 -g -Wall -ftree-vectorizer-verbose=7 -msse -msse2 -msse3 -c test.cpp 

test.cpp:14: note: versioning for alias required: can't determine dependence between *D.4096_24 and *D.4095_21 
test.cpp:14: note: mark for run-time aliasing test between *D.4096_24 and *D.4095_21 
test.cpp:14: note: Alignment of access forced using versioning. 
test.cpp:14: note: Vectorizing an unaligned access. 
test.cpp:14: note: vect_model_load_cost: unaligned supported by hardware. 
test.cpp:14: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . 
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 . 
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 1 . 
test.cpp:14: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 . 
test.cpp:14: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 . 
test.cpp:14: note: cost model: Adding cost of checks for loop versioning to treat misalignment. 

test.cpp:14: note: cost model: Adding cost of checks for loop versioning aliasing. 

test.cpp:14: note: Cost model analysis: 
    Vector inside of loop cost: 8 
    Vector outside of loop cost: 6 
    Scalar iteration cost: 5 
    Scalar outside cost: 1 
    prologue iterations: 0 
    epilogue iterations: 0 
    Calculated minimum iters for profitability: 2 

test.cpp:14: note: Profitability threshold = 3 

test.cpp:14: note: Vectorization may not be profitable. 
test.cpp:14: note: create runtime check for data references *D.4096_24 and *D.4095_21 
test.cpp:14: note: created 1 versioning for alias checks. 

test.cpp:14: note: LOOP VECTORIZED. 
test.cpp:4: note: vectorized 1 loops in function. 

Compilation finished at Wed Feb 16 19:17:59 
2

它不喜欢它防止它理解内环外环格式。我可以让它向量化,如果我只是把它折叠成一个单一的循环:

#include <stdlib.h> 
#include <math.h> 
int main(int argc, char ** argv) { 
    const float * __restrict__ input = malloc(20000*sizeof(float)); 
    float * __restrict__ output = malloc(20000*sizeof(float)); 

    for(unsigned int i=0; i<=10100; i++) { 
      output[i] = input[i] * 0.1f; 
    } 
} 

(请注意,我并没有想太多难以有关如何在POS +限休息正确转换成一个单一的循环条件,它可能是错误的)

你可能可以利用这个优势,把一个简化的内部循环放入一个你用指针和计数调用的函数中。即使再次内联,它也可以正常工作。假设你删除了我刚刚简化过的while()循环的部分内容,但需要保留。