为什么numpy.dot无法在多于2维的场景上并行化

我注意到了numpy.dot()函数的一个有趣行为。我的企业RedHat 6.7有两个Xeon CPU，每个CPU有12个内核。我运行下面的代码片段，然后检查CPU利用率htop为什么numpy.dot无法在多于2维的场景上并行化

下面的代码使用所有的内核我的服务器上：

import numpy as np 
a = np.random.rand(1000, 1000) 
b = np.random.rand(1000, 5) 
z = a.dot(b) #or use %timeit a.dot(b) if you use ipython

编辑： 下面是HTOP的屏幕截图，同时运行上面的代码

但是，只要我像以下那样再添加一个尺寸到b，就只用了一个内核。

import numpy as np 
a = np.random.rand(1000, 1000) 
b = np.random.rand(1, 1000, 5) #or np.random.rand(n, 1000, 5) where n>=1 
z = a.dot(b) #or use %timeit a.dot(b) if you use ipython

编辑： 下面是HTOP的屏幕截图，同时运行以上

下面的代码是我的Python环境从import sys; sys.version配置

'2.7.11 |Continuum Analytics, Inc.| (default, Dec 6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

下面是配置来自numpy.show_config()的信息

lapack_opt_info: 
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] 
library_dirs = ['/opt/anaconda2/envs/portopt/lib'] 
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] 
include_dirs = ['/opt/anaconda2/envs/portopt/include'] 
blas_opt_info: 
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] 
library_dirs = ['/opt/anaconda2/envs/portopt/lib'] 
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] 
include_dirs = ['/opt/anaconda2/envs/portopt/include'] 
openblas_lapack_info: NOT AVAILABLE 
lapack_mkl_info: 
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] 
library_dirs = ['/opt/anaconda2/envs/portopt/lib'] 
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] 
include_dirs = ['/opt/anaconda2/envs/portopt/include'] 
blas_mkl_info: 
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] 
library_dirs = ['/opt/anaconda2/envs/portopt/lib'] 
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] 
include_dirs = ['/opt/anaconda2/envs/portopt/include'] 
mkl_info: 
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'] 
library_dirs = ['/opt/anaconda2/envs/portopt/lib'] 
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] 
include_dirs = ['/opt/anaconda2/envs/portopt/include']

有人看过这个吗？我倾向于认为这是一个错误，而不是设计，因为显然还有更多的工作要做。另外，有没有办法强制numpy.dot的腭化？在此先感谢！

更新： 我找到了一种解决方法来加速计算。请参阅下面的代码片段。

import numpy as np 
a = np.random.rand(1000, 1000) #in my program a variable 
b = np.random.rand(100, 1000, 5) #b is a constant 
z1 = a.dot(b) 
c=b.swapaxes(0, 1).reshape(1000, 5*100) #the trick is to turn the 3d array into a 2d matrix 
z2 = a.dot(c).reshape(z1.shape) #then reshape the result to the desired shape. 
np.allclose(z1, z2) #the results are identical but the computation of z2 is more than 10 times faster than that of z1 on my server.

不过，我同意从长远来看，我们应该研究numpy的代码@hpaulj曾建议并修复问题（事件是一个bug）一劳永逸。

来源

2016-03-09 Fei Meng

分享您的HTOP的检查，请！ –

我认为你必须学习C源代码，如

https://github.com/numpy/numpy/blob/2f7827702ef6b6ac4b318103d5c0dfe2ff6e7eb3/numpy/core/src/multiarray/cblasfuncs.c

cblas_matrixproduct有很多的代码，检查2输入阵列的尺寸。到最后，有一部分处理矩阵乘法。

(PyArray_NDIM(ap1) == 2 && PyArray_NDIM(ap2) == 2)

它看起来像计算核心是用括号和NPY_BEGIN_ALLOW_THREADSNPY_END_ALLOW_THREADS

你MKL的代码可能工作为BLAS的替代品。

现在的技巧是找到一个3D数组正在处理的位置。不知何故它在切片上运行，所以BLAS代码仍然可以看到一个二维数组。

我的猜测是在BLAS/MKL代码中使用了多个核心，而不是在numpy代码中。换句话说，numpy代码表示（编译器）“这是确定在这里使用线程和/或核”，而不是“这里是如何分裂它基于阵列尺寸内核之间”。

https://github.com/numpy/numpy/blob/386639363233165bcba1f1ba7b10aff3c40d46b3/numpy/core/src/multiarray/multiarraymodule.c

PyArray_MatrixProduct2似乎是决定如何调用BLAS点的功能我刚才发现的功能。

的2 2D矩阵的情况下似乎与处理：

#if defined(HAVE_CBLAS) 
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 && 
     (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum || 
     NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) { 
    return cblas_matrixproduct(typenum, ap1, ap2, out); 
}

否则它必须使用这样的代码（确保之后的校正尺寸兼容）：

NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(ap2)); 
while (it1->index < it1->size) { 
    while (it2->index < it2->size) { 
     dot(it1->dataptr, is1, it2->dataptr, is2, op, l, ret); 
     op += os; 
     PyArray_ITER_NEXT(it2); 
    } 
    PyArray_ITER_NEXT(it1); 
    PyArray_ITER_RESET(it2); 
} 
NPY_END_THREADS_DESCR(PyArray_DESCR(ap2));

其中dot = PyArray_DESCR(ret)->f->dotfunc;具有根据dtype定义。

我不知道我已经回答了你的问题，但很显然，代码是复杂的，以及如何你我会分裂的任务不适用于简单的推理。

来源

2016-03-10 00:37:40 hpaulj

感谢您提供及时，深刻的答案。我想它会带我一段时间来阅读代码，并查明确切的问题，即使您已大大缩小了搜索范围。我发现我的即时需求的方法，但理解的C代码，肯定会从长远来看有很大价值的。 –

为什么numpy.dot无法在多于2维的场景上并行化

回答

相关问题