C++ OpenMP：写入for循环内的矩阵显着减慢for循环

我有以下代码。 bitCount函数只是计算64位整数中的位数。 test函数是一个类似的例子，我正在做一些更复杂的代码，我试图在其中复制如何写入矩阵显着减慢for循环的性能，我试图找出为什么它是这样做的，以及是否有解决方案。C++ OpenMP：写入for循环内的矩阵显着减慢for循环

#include <vector> 
#include <cmath> 
#include <omp.h> 

// Count the number of bits 
inline int bitCount(uint64_t n){ 

    int count = 0; 

    while(n){ 

    n &= (n-1); 
    count++; 

    } 

    return count; 

} 


void test(){ 

    int nthreads = omp_get_max_threads(); 
    omp_set_dynamic(0); 
    omp_set_num_threads(nthreads); 

    // I need a priority queue per thread 
    std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY)); 
    std::vector<uint64_t> vals(100,1); 

    # pragma omp parallel for shared(mat,vals) 
    for(int i = 0; i < 100000000; i++){ 
    std::vector<double> &tid_vec = mat[omp_get_thread_num()]; 
    int total_count = 0; 
    for(unsigned int j = 0; j < vals.size(); j++){ 
     total_count += bitCount(vals[j]); 
     tid_vec[j] = total_count; // if I comment out this line, performance increase drastically 
    } 
    } 

}

此代码在约11秒内运行。如果我注释掉以下行：

tid_vec[j] = total_count;

该代码在大约2秒钟内运行。为什么在我的案例中写矩阵的成本如此之高？

来源

2017-02-23 Cauchy

根据您的编译器和选项，删除序列化存储时，内部循环缩减可能会被simd矢量化。 – tim18

没有存储的情况下，for循环也不会做任何事情。也许它被优化了？ –

如果你想要一个特定的答案，而不是只是猜测，你必须提供关于编译器版本，选项，硬件和[mcve]的详细信息。另请注意，“bitcount”被广泛称为“popcnt”，并已被优化为遗忘。 – Zulan

既然你没有提到你的编译器/系统规格，我假设你正在编译GCC并标记-O2 -fopenmp。

如果你对此有何评论行：

tid_vec[j] = total_count;

编译器将优化掉所有的，其结果不使用的计算。因此：

total_count += bitCount(vals[j]);

也进行了优化。如果您的应用程序主内核没有被使用，则程序运行得更快是有意义的。

另一方面，我不会自己实现一个位计数函数，而是依赖于已经提供给您的功能。例如，GCC builtin functions包括__builtin_popcount，这正是您正在尝试执行的操作。

作为一个好处：处理私有数据比处理使用不同数组元素的公共数组更好。它改善了局部性（当访问内存不统一时，尤其重要，即NUMA），并可能减少访问争用。

# pragma omp parallel shared(mat,vals) 
{ 
std::vector<double> local_vec(1000,-INFINITY); 
#pragma omp for 
for(int i = 0; i < 100000000; i++) { 
    int total_count = 0; 
    for(unsigned int j = 0; j < vals.size(); j++){ 
    total_count += bitCount(vals[j]); 
    local_vec[j] = total_count; 
    } 
} 
// Copy local vec to tid_vec[omp_get_thread_num()] 
}

来源

2017-02-24 10:58:03

C++ OpenMP：写入for循环内的矩阵显着减慢for循环

回答

相关问题