你有多确定你没有加速?
尝试它两种方式 - 数组的结构和数组,用gcc -O3编译(gcc 4。6)上的双四核Nehalem,我得到psize-n_dead = 200000,运行100次迭代获得更好的计时器精度:
阵列的结构(报告的时间以毫秒为单位)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
阵列结构的:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
不过,我因为操作是如此之短(亚毫秒),我没有看到任何的加速没有做,因为计时器精度100次迭代。此外,你必须有一台具有良好内存带宽的机器来获得这种行为;你只做〜3个FMA和读取的每两个数据的另一个乘法。
结构数组的代码如下所示。
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}
'psize-n_dead'有多大? – Mysticial 2012-02-10 20:03:01
它随着时间而增长,但是在1000s左右。所以说4000是最简单的状态,并且可能最高会达到20万。 – user1202831 2012-02-10 20:07:54