优化并重写以下C代码

这是一个教科书问题，涉及重写某些C代码以使其在给定处理器体系结构上表现最佳。优化并重写以下C代码

鉴于：针对具有4个加法器和2个乘法器单元的单个超标量处理器。

输入结构（在别处初始化）：

struct s { 
    short a; 
    unsigned v; 
    short b; 
} input[100];

下面是对这个数据进行操作的例程。显然必须确保正确性，但目标是优化它的垃圾。

int compute(int x, int *r, int *q, int *p) { 

    int i; 
    for(i = 0; i < 100; i++) { 

     *r *= input[i].v + x; 
     *p = input[i].v; 
     *q += input[i].a + input[i].v + input[i].b; 
    } 

    return i; 
}

所以这个方法有3个算术指令来更新整数r，q，p。

这是我尝试用注释解释我在想什么：

//Use temp variables so we don't keep using loads and stores for mem accesses; 
//hopefully the temps will just be kept in the register file 
int r_temp = *r; 
int q_temp = *q; 

for (i=0;i<99;i = i+2) { 
    int data1 = input[i]; 
    int data2 = input[i+1]; //going to try partially unrolling loop 
    int a1 = data1.a; 
    int a2 = data2.a; 
    int b1 = data1.b; 
    int b2 = data2.b; 
    int v1 = data1.v; 
    int v2 = data2.v; 

    //I will use brackets to make my intention clear the order of operations I was planning 
    //with respect to the functional (adder, mul) units available 

    //This is calculating the next iteration's new q value 
    //from q += v1 + a1 + b1, or q(new)=q(old)+v1+a1+b1 

    q_temp = ((v1+q1)+(a1+b1)) + ((a2+b2)+v2); 
    //For the first step I am trying to use a max of 3 adders in parallel, 
    //saving one to start the next computation 

    //This is calculating next iter's new r value 
    //from r *= v1 + x, or r(new) = r(old)*(v1+x) 

    r_temp = ((r_temp*v1) + (r_temp*x)) + (v2+x); 
} 
//Because i will end on i=98 and I only unrolled by 2, I don't need to 
//worry about final few values because there will be none 

*p = input[99].v; //Why it's in the loop I don't understand, this should be correct 
*r = r_temp; 
*q = q_temp;

好了，所以没有什么我的解决方案给我？查看旧代码，我的每个循环迭代将采用max（（1A + 1M），（3A））的最小延迟，其中前一个值用于计算新的r，而3个添加的延迟是q。

在我的解决方案中，我展开2并尝试计算r和q的第二个新值。假设加法器/乘法器的等待时间是M = c * A（c是大于1的整数），r的乘法运算绝对是限速步骤，所以我把重点放在这一点上。我尽可能地并行使用乘法器。

在我的代码中，首先并行使用2个乘法器来帮助计算中间步骤，然后一个加法必须组合这些乘法器，然后使用最终乘法获得最后一个结果。因此，对于2个新的r值（尽管我只保留/关心最后一个），它需要我（1M // 1M // 1A）+ 1A + 1M，总共延迟为2M + 1M。除以2，我的每回路延迟值为1M + 0.5A。我计算原始延迟/值（对于r）为1A + 1M。所以如果我的代码是正确的（我手工完成了这个，还没有测试过！），那么我的性能提升很小。另外，希望不要直接在循环中直接访问和更新指针（主要归功于临时变量r_temp和q_temp），我可以节省一些加载/存储延迟。

这是我的刺伤。绝对有兴趣看到别人提出的更好！

来源

2012-09-11 JDS

是否允许将数组布局从数组结构更改为数组结构？ – Mysticial

据我所知，输入结构保持原样。我不知道的是，是否有可能利用两个字段类型为“short”的事实来帮助优化。 – JDS

我不知道，但切换到结构的数组可能会使其可矢量化。 – Mysticial

是的，可以利用这两个短裤。重新排列你的结构是这样

struct s { 
    unsigned v; 
    short a; 
    short b; 
} input[100];

，你可能能够得到您的体系结构的存储领域，这可能让更多的这些结构的趴在同一个内存页，这可能让你遇到更好的一致性更少的内存页面故障。

这都是推测，这就是为什么它是如此重要的配置文件。

如果你有正确的体系结构，重新排列会给你更好的data structure alignment,，这会导致内存中更高的数据密度，因为为了确保类型与通用存储器体系结构强加的数据边界对齐，需要填充的位会减少。

来源

2012-09-11 18:42:57

'short'保证小于或等于'int'，所以重新排列绝对是一个好主意 - 它永远不会受到伤害。 –

一般的经验法则可能不完美，但优于没有经验法则的是从最长到最短列出结构类型。它倾向于确保比随机排序更好的包装。 –

这是比较微妙的，我没有考虑，谢谢。 – JDS

优化并重写以下C代码

回答

相关问题