Cython的prange没有提高性能

我试图用Cython的prange来提高一些度量计算的性能。这是我的代码：Cython的prange没有提高性能

def shausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None): 
    cdef: 
     Py_ssize_t i 
     Py_ssize_t n = XB.shape[2] 
     float64_t[::1] hdist = np.zeros(n) 

    #arrangement to fix contiguity 
    XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)]) 

    for i in range(n): 
     hdist[i] = _hausdorff(XA, XB[i]) 
    return hdist 

def phausdorff(float64_t[:,::1] XA not None, float64_t[:,:,::1] XB not None): 
    cdef: 
     Py_ssize_t i 
     Py_ssize_t n = XB.shape[2] 
     float64_t[::1] hdist = np.zeros(n) 

    #arrangement to fix contiguity (EDITED) 
    cdef float64_t[:,:,::1] XC = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)]) 

    with nogil, parallel(num_threads=4): 
     for i in prange(n, schedule='static', chunksize=1): 
      hdist[i] = _hausdorff(XA, XC[i]) 
    return hdist

基本上，在每次迭代中Hausdorff度量计算XA并且每个XB[i]之间。这里是_hausdorff函数的签名：

cdef inline float64_t _hausdorff(float64_t[:,::1] XA, float64_t[:,::1] XB) nogil: 
    ...

我的问题是，无论是连续shausdorff和并行phausdorff具有相同的时序。此外，看起来phausdorff根本没有创建任何线程。

所以我的问题是我的代码有什么问题，我该如何解决它以获得线程工作。

这是我的setup.py：

from distutils.core import setup 
from distutils.extension import Extension 
from Cython.Build import cythonize 
from Cython.Distutils import build_ext 

ext_modules=[ 
    Extension("custom_metric", 
       ["custom_metric.pyx"], 
       libraries=["m"], 
       extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ], 
       extra_link_args=['-fopenmp'] 
      ) 
] 

setup( 
    name = "custom_metric", 
    cmdclass = {"build_ext": build_ext}, 
    ext_modules = ext_modules 
)

EDIT 1：下面是由cython -a生成的HTML的链接：custom_metric.html

编辑2：下面是如何调用的示例相应的功能（需要先编译the Cython file）

import custom_metric as cm 
import numpy as np 

XA = np.random.random((9000, 210)) 
XB = np.random.random((1000, 210, 9)) 

#timing 'parallel' version 
%timeit cm.phausdorff(XA, XB) 

#timing sequential version 
%timeit cm.shausdorff(XA, XB)

来源

2016-08-19 mavillan

您是否尝试在'prange'的循环体内将等价物打印到'omp_get_thread_num（）'。请参阅http://cython.readthedocs.io/en/latest/src/userguide/parallelism.html – Harald

可能是'XB'是一个Python对象？用注释运行'cython -a custom_metric.pyx'。 – cgohlke

如果用'@ cython.boundscheck（False）'和'@ cython.wraparound（False）'装饰'phausdorff'，是否有任何更改？ –

我认为这种并行化是可行的，但并行化的额外开销正在浪费它可以节省的时间。如果我尝试用不同大小的数组，然后我就开始看到一个加快并行版本

XA = np.random.random((900, 2100)) 
XB = np.random.random((100, 2100, 90))

这里的水货版本需要〜2/3的串行版本对我来说，这无疑ISN”的时间这是你所期望的1/4，但至少会显示出一些好处。

的一个改进，我可以提供是替代，修复连续性代码：

XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])

与

XB = np.ascontiguousarray(np.transpose(XB,[2,0,1]))

这将加快平行和非平行功能相当显著（你最初给出的数组是2的因子）。它的确会让你更加明显地看到，你在prange中被开销放缓 - 在你的例子中，序列版本实际上更快了。

来源

2016-08-27 08:19:20 DavidW

（发布为社区维基，因为这不提供解决方案，所以我想从赏金中删除它） – DavidW

Cython的prange没有提高性能

回答

相关问题