犰狳C++：如何高效地分配到子域

我想将矩阵的值分配给另一个矩阵的子域作为A.submat(ni1, ni2, nk1, nk2) = B;看起来非常慢。我想知道为什么它如此缓慢，有什么方法可以改善它吗？犰狳C++：如何高效地分配到子域

这里是我的测试代码（因为函数“XForwarDifference”需要被调用数以百万计的时间在我的项目，我需要更好地描述文件时）

#include <armadillo> 
#include <chrono> 
#include <iostream> 
using ms = std::chrono::milliseconds; 
using ns = std::chrono::nanoseconds; 
using get_time = std::chrono::steady_clock; 

namespace { 
    const arma::ivec::fixed<5> iforward = {-1, 0, 1, 2, 3}; 
    const double MCA_1 = -0.30874; 
    const double MCA0 = -0.6326; 
    const double MCA1 = 1.2330; 
    const double MCA2 = -0.3334; 
    const double MCA3 = 0.04168; 
} 

arma::mat XForwardDifference(arma::mat& mat, 
          const int& ni1, 
          const int& ni2, 
          const int& nk1, 
          const int& nk2, 
          const double& dx) 
{ 
    arma::mat ret(size(mat)); 
    double sign_dx = 1./dx; 
    double mca_1= sign_dx * MCA_1; 
    double mca0 = sign_dx * MCA0; 
    double mca1 = sign_dx * MCA1; 
    double mca2 = sign_dx * MCA2; 
    double mca3 = sign_dx * MCA3; 
    auto t1 = get_time::now(); 
    auto m1 = mat.submat(ni1+iforward(0), nk1, ni2+iforward(0), nk2); 
    auto t2 = get_time::now(); 
    auto m2 = mca_1 * m1; 
    auto t3 = get_time::now(); 
    auto m3 = m2 + m2; 
    auto t4 = get_time::now(); 
    mat.submat(ni1, nk1, ni2, nk2) = m3; 
    auto t5 = get_time::now(); 
    std::cout << std::chrono::duration_cast<ns>(t2-t1).count() << std::endl; 
    std::cout << std::chrono::duration_cast<ns>(t3-t2).count() << std::endl; 
    std::cout << std::chrono::duration_cast<ns>(t4-t3).count() << std::endl; 
    std::cout << std::chrono::duration_cast<ns>(t5-t4).count() << std::endl; 


    // ret.submat(ni1, nk1, ni2, nk2) = 
    // mca_1* mat.submat(ni1+iforward(0), nk1, ni2+iforward(0), nk2) + 
    // mca0 * mat.submat(ni1+iforward(1), nk1, ni2+iforward(1), nk2) + 
    // mca1 * mat.submat(ni1+iforward(2), nk1, ni2+iforward(2), nk2) + 
    // mca2 * mat.submat(ni1+iforward(3), nk1, ni2+iforward(3), nk2) + 
    // mca3 * mat.submat(ni1+iforward(4), nk1, ni2+iforward(4), nk2); 
    return ret; 
} 


int main(int argc, char *argv[]) 
{ 
    const int len = 3; 
    int ni1, ni2, nk1, nk2, ni, nk; 
    ni = 200; 
    nk = 200; 
    ni1 = len; 
    ni2 = ni1 + ni - 1; 
    nk1 = len; 
    nk2 = nk1 + nk - 1; 
    const double dx = 1.; 
    auto start_time = get_time::now(); 
    arma::mat mat(ni + 2*len, nk + 2*len); 
    mat = XForwardDifference(mat, ni1, ni2, nk1, nk2, dx); 
    auto end_time = get_time::now(); 
    auto diff = end_time - start_time; 
    std::cout << "Elapsed time is : " 
      << std::chrono::duration_cast<ns>(diff).count() 
      << " ns " 
      << std::endl; 
    return 0; 
}

输出是：

180 
116 
110 
851123 
Elapsed time is : 961975 ns

您可以看到mat.submat(ni1, nk1, ni2, nk2) = m3;涵盖了大部分已用时间。

hbrerkere给出的理由：

，直到结果被分配到矩阵或子矩阵犰狳排队的所有操作。这就是为什么分配给submat似乎需要更长时间。它实际上是在分配时间进行乘法和加法，而不是之前。另外，不要在Armadillo矩阵和表达式中使用auto关键字，因为这可能会导致问题。 - hbrerkere

auto t1 = get_time::now(); 
    arma::mat m1 = mat.submat(ni1+iforward(0), nk1, ni2+iforward(0), nk2); 
    auto t2 = get_time::now(); 
    arma::mat m2 = mca_1 * m1; 
    auto t3 = get_time::now(); 
    arma::mat m3 = m2 + m2; 
    auto t4 = get_time::now(); 
    ret = m3; 
    auto t5 = get_time::now();

如果我修改代码如他所说，那么现在输出低于：

391880 
356480 
373072 
113051 
Elapsed time is : 1352013 ns

我也遇到这种情况该auto会带来问题的犰狳。

auto m1 = mat.submat(ni1, nk1, ni2, nk2) * 2; 
cout << size(m1) << endl;

它会打印非常大的尺寸，这是不正确的。

来源

2017-08-30 Aristotle0

Armadillo将所有操作排队，直到结果分配给矩阵或子矩阵。这就是为什么分配给submat似乎需要更长时间。它实际上是在分配时间进行乘法和加法，而不是之前。另外，请勿在Armadillo矩阵和表达式中使用_auto_关键字，因为这可能会导致问题。 – hbrerkere

@hbrerkere你的回答对我很有帮助。如果我使用'arma :: mat'而不是'auto'，那么成本时间就像你说的那样改变了。 – Aristotle0

您的代码中似乎没有具体的一点，应该让它变慢。你在做什么看起来很好。我建议的唯一的事情是，你应该尽量减少submat调用的数量，因为它可能会创建一个临时矩阵，因为它被认为相对昂贵。考虑在没有submat的情况下做数学计算。只需计算索引并自己分配它们。

鉴于这种情况，我还有其他两点建议：

如果硬要有可能是坏了你的代码什么的改进，研究替代的方式来写它（如果可能），并使用function profiler ，比如Valgrind，看看哪些功能成本最高，以及是否可以优化。
如果你放弃这一点，考虑学习如何使用C++进行多线程。现在，这很容易。要么使用OpenMP，这是非常容易的...但是你必须做到这一点，因为它容易犯错误（例如high contention）;或使用std::thread，这是一个相对较新的C++构造，它只是简单地在一个线程中运行一个函数。对于你的情况，由于你的应用程序是反复的，你可以使用它们中的任何一个。您可以使用my simple thread pool实现，该实现一旦完成就会安排新实例的调用。

祝你好运。

来源

2017-08-30 12:33:04

犰狳C++：如何高效地分配到子域

回答

相关问题