2013-01-18 44 views
0

第一种方法(并行内环):OpenMP的并行用于构建性能

for(j=0; j<LATTICE_VW; ++j) { 
    x = j*DX + LATTICE_W; 
    #pragma omp parallel for ordered private(y, prob) 
     for(i=0; i<LATTICE_VH; ++i) { 
      y = i*DY + LATTICE_S; 
      prob = psi[i][j].norm(); 

      #pragma omp ordered 
       out << x << " " << y << " " << prob << endl; 
     } 
} 

第二方法(并行外环):

#pragma omp parallel for ordered private(x, y, prob) 
    for(j=0; j<LATTICE_VW; ++j) { 
     x = j*DX + LATTICE_W; 
     for(i=0; i<LATTICE_VH; ++i) { 
      y = i*DY + LATTICE_S; 
      prob = psi[i][j].norm(); 

      #pragma omp ordered 
       out << x << " " << y << " " << prob << endl; 
     } 
    } 

第三方法(并行折叠环路)

#pragma omp parallel for collapse(2) ordered private(x, y, prob) 
    for(j=0; j<LATTICE_VW; ++j) { 
     for(i=0; i<LATTICE_VH; ++i) { 
      x = j*DX + LATTICE_W; 
      y = i*DY + LATTICE_S; 
      prob = psi[i][j].norm(); 

      #pragma omp ordered 
       out << x << " " << y << " " << prob << endl; 
     } 
    } 

如果我要猜测我会说方法3应该是最快的。

然而,方法1是最快的,而第二和第三方都需要大约相同的时间量,就好像没有并行化一样。为什么发生这种情况?

+0

您是否从方法2获得了正确的输出?变量'y'和'prob'也应该是私人的。 – Novelocrat

+0

对不起,他们在那里是私人的。刚刚编辑它 – lexsintra

+0

内外环的行程数是多少? –

回答

0

看看这个:

for(int x = 0; x < 4; ++x) 
    #pragma omp parallel for ordered 
    for(int y = 0; y < 4; ++y) 
    #pragma omp ordered 
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl; 

您有:

0,0 (by thread 0) 
0,1 (by thread 1) 
0,2 (by thread 2) 
0,3 (by thread 3) 
1,0 (by thread 0) 
1,1 (by thread 1) 
1,2 (by thread 2) 
1,3 (by thread 3) 

每个线程只是要等待一段cout所有的工作之前,可以并行进行。 但随着:

#pragma omp parallel for ordered 
for(int x = 0; x < 4; ++x) 
    for(int y = 0; y < 4; ++y) 
    #pragma omp ordered 
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl; 

#pragma omp parallel for collapse(2) ordered 
for(int x = 0; x < 4; ++x) 
    for(int y = 0; y < 4; ++y) 
    #pragma omp ordered 
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl; 

的情况是:

0,0 (by thread 0) 
0,1 (by thread 0) 
0,2 (by thread 0) 
0,3 (by thread 0) 
1,0 (by thread 1) 
1,1 (by thread 1) 
1,2 (by thread 1) 
1,3 (by thread 1) 
2,0 (by thread 2) 
2,1 (by thread 2) 
2,2 (by thread 2) 
2,3 (by thread 2) 
3,0 (by thread 3) 
3,1 (by thread 3) 
3,2 (by thread 3) 
3,3 (by thread 3) 

所以thread 1必须等待thread 0完成所有的工作,才可以cout第一次,几乎没有什么可以同时完成。

尝试在崩溃版本中添加schedule(static,1),它应该至少和第一个版本一样好。