2017-02-20 111 views
1

我想加快我的代码使用cython。将代码翻译成Python的cython后,我看到我没有获得任何加速。我认为问题的根源在于我将numpy数组转换为cython时表现不佳。Cython:缓慢的numpy阵列

我已经想出了一个非常简单的程序,以显示这一点:

############### test.pyx ################# 
import numpy as np 
cimport numpy as np 
cimport cython 

def func1(long N): 

    cdef double sum1,sum2,sum3 
    cdef long i 

    sum1 = 0.0 
    sum2 = 0.0 
    sum3 = 0.0 

    for i in range(N): 
     sum1 += i 
     sum2 += 2.0*i 
     sum3 += 3.0*i 

    return sum1,sum2,sum3 

def func2(long N): 

    cdef np.ndarray[np.float64_t,ndim=1] sum_arr 
    cdef long i 

    sum_arr = np.zeros(3,dtype=np.float64) 

    for i in range(N): 
     sum_arr[0] += i 
     sum_arr[1] += 2.0*i 
     sum_arr[2] += 3.0*i 

    return sum_arr 

def func3(long N): 

    cdef double sum_arr[3] 
    cdef long i 

    sum_arr[0] = 0.0 
    sum_arr[1] = 0.0 
    sum_arr[2] = 0.0 

    for i in range(N): 
     sum_arr[0] += i 
     sum_arr[1] += 2.0*i 
     sum_arr[2] += 3.0*i 

    return sum_arr 
########################################## 

################## test.py ############### 
import time 
import test as test 

N = 1000000000 

for i in xrange(10): 

    start = time.time() 
    sum1,sum2,sum3 = test.func1(N) 
    print 'Time taken = %.3f'%(time.time()-start) 

print '\n' 
for i in xrange(10): 
    start = time.time() 
    sum_arr = test.func2(N) 
    print 'Time taken = %.3f'%(time.time()-start) 

print '\n' 
for i in xrange(10): 
    start = time.time() 
    sum_arr = test.func3(N) 
    print 'Time taken = %.3f'%(time.time()-start) 
############################################ 

而且从蟒蛇test.py我得到:

Time taken = 1.445 
Time taken = 1.433 
Time taken = 1.434 
Time taken = 1.428 
Time taken = 1.449 
Time taken = 1.425 
Time taken = 1.421 
Time taken = 1.451 
Time taken = 1.483 
Time taken = 1.418 

Time taken = 2.623 
Time taken = 2.603 
Time taken = 2.977 
Time taken = 3.237 
Time taken = 2.748 
Time taken = 2.798 
Time taken = 2.811 
Time taken = 2.783 
Time taken = 2.585 
Time taken = 2.595 

Time taken = 1.503 
Time taken = 1.529 
Time taken = 1.509 
Time taken = 1.543 
Time taken = 1.427 
Time taken = 1.425 
Time taken = 1.423 
Time taken = 1.415 
Time taken = 1.414 
Time taken = 1.418 

我的问题是:为什么FUNC2几乎是2倍速度较慢比func1和func3?

有没有办法改善这一点?

谢谢!

######## UPDATE

我真正的问题如下。我正在调用接受3D数组的函数(比如P [i,j,k])。函数将遍历每个元素并计算几个量:一个数量取决于该位置数组的值(比如A = f(P [i,j,k])),另一个量只取决于位置(B = g(i,j,k))。示意图如下:

for i in xrange(N): 
    corr1 = h(i,val) 

    for j in xrange(N): 
     corr2 = h(j,val) 

     for k in xrange(N): 
      corr3 = h(k,val) 

      A = f(P[i,j,k]) 
      B = g(i,j,k) 
      Arr[B] += A*corr1*corr2*corr3 

其中val是由数字表示的3D数组的属性。这个数字对于不同的领域可能是不同的。

由于我必须对许多3D数组进行这种操作,我认为如果我创建一个接受许多不同输入3D数组的新例程会更好,从而使数组的数量未知。这个想法是因为B在所有数组中都是完全相同的,所以我可以避免为每个数组计算它,只计算一次。问题是,CORR1,CORR2,corr3上面会成为数组:

如果我有一些3D阵列等于num_3D_arrays我做的事情为:

for i in xrange(N): 
    for p in xrange(num_3D_arrays): 
     corr1[p] = h(i,val[p]) 

    for j in xrange(N): 
     for p in xrange(num_3D_arrays): 
      corr2[p] = h(j,val[p]) 

     for k in xrange(N): 
      for p in xrange(num_3D_arrays): 
       corr3[p] = h(k,val[p]) 

      B = g(i,j,k) 
      for p in xrange(num_3D_arrays): 
       A[p] = f(P[i,j,k]) 
       Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p] 

所以VAL,我改变从标量到数组的变量corr1,corr2,corr3和A正在消除我期望避免执行大循环的性能。

+0

代码范围(N): sum_arr [0] + = i sum_arr [1] + = 2.0 * i sum_arr [2] + = 3.0 * i'忽略numpy擅长的所有内容。 Numpy不是很快,因为你可以快速访问索引,但是因为它可以快速进行数字操作。但不是那样。我建议读入numpy –

+0

我想这很难让它更快。因为假如你固执地使用'numpy',你必须在该循环中创建numpy数组,并执行np.sum(),但创建numpy数组可能是该代码片段中最慢的事情。我还建议分别检查每条线,而不是这个简单的时间。 ** [一些阅读分析](http://stackoverflow.com/questions/582336/how-can-you-profile-a-script)** –

+0

好的谢谢!在我的情况下,问题是我不能像func1那样定义单个变量,但是我需要定义一个我不知道先验的大小的数组。有没有不同的方式来做到这一点比使用numpy数组? – Francisco

回答

0
  • 为什么FUNC2几乎是2倍比FUNC1慢?

    这是因为索引会导致间接性,因此您将基本操作数加倍。计算总和像func1,然后影响与 sum=array([sum1,sum2,sum3])

  • 如何加快python代码?

    1. numpy是第一个好主意,它不费吹灰之力就提高了近C的速度。

    2. 努巴可以不费吹灰之力,而且非常简单。

    3. Cython for critical cases。

这里是一些举例说明的是:

# python way 
def func1(N): 
    sum1 = 0.0 
    sum2 = 0.0 
    sum3 = 0.0 

    for i in range(N): 
     sum1 += i 
     sum2 += 2.0*i 
     sum3 += 3.0*i 

    return sum1,sum2,sum3 

# numpy way 
def func2(N): 
    aran=arange(float(N)) 
    sum1=aran.sum() 
    sum2=(2.0*aran).sum() 
    sum3=(3.0*aran).sum() 
    return sum1,sum2,sum3 

#numba way 
import numba  
func3 =numba.njit(func1) 

""" 
In [609]: %timeit func1(10**6) 
1 loop, best of 3: 710 ms per loop 

In [610]: %timeit func2(1e6) 
100 loops, best of 3: 22.2 ms per loop 

In [611]: %timeit func3(10e6) 
100 loops, best of 3: 2.87 ms per loop 
""" 
+0

感谢您的回复!我真正的问题比func1和func2更复杂,我只是用它来展示问题。我不完全理解的是为什么索引与numpy非常缓慢。如果我定义例程: – Francisco

+0

DEF FUNC3(长N): CDEF双sum_arr [3] CDEF长我 sum_arr [0] = 0.0; sum_arr [01] = 0.0; sum_arr [2] =在范围0.0 对于i(N): sum_arr [0] + = I sum_arr [1] + = 2.0 * I sum_arr [2] + = 3.0 * I 返回sum_arr – Francisco

+0

SUM1当sum_arr [i] + = 23是sum_arr [ref + i] + = 23时,+ 23 = 1ns成本2ns。它是一个麻烦的问题,这是一个组装级别的问题。 –

0

看由cython -a ...pyx产生的html

对于func1,所述sum1 += i线扩展为:

+15:   sum1 += i 
    __pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i); 

func3为,具有C阵列

+45:   sum_arr[0] += i 
    __pyx_t_3 = 0; 
    (__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i); 

稍微复杂一些,但直向前c

但对于func2

+29:   sum_arr[0] += i 
    __pyx_t_12 = 0; 
    __pyx_t_6 = -1; 
    if (__pyx_t_12 < 0) { 
     __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape; 
     if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0; 
    } else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0; 
    if (unlikely(__pyx_t_6 != -1)) { 
     __Pyx_RaiseBufferIndexError(__pyx_t_6); 
     __PYX_ERR(0, 29, __pyx_L1_error) 
    } 
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i; 

复杂得多一起numpy功能(例如Pyx_BUfPtrStrided1d)的引用。偶数初始化数组是复杂的:

+26:  sum_arr = np.zeros(3,dtype=np.float64) 
    __pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error) 
    __Pyx_GOTREF(__pyx_t_1); 
    .... 

我希望移动sum_arr创建到调用Python和传递它作为参数传递给func2会节省一些时间。

有阅读本指南使用memoryviews

http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html

如果你专注于写作的低级操作,使他们转化为简单的c你会得到最好的性能cython。在

for k in xrange(N): 
     corr3 = h(k,val) 

     A = f(P[i,j,k]) 
     B = g(i,j,k) 
     Arr[B] += A*corr1*corr2*corr3 

它不是i,j,k环,将你慢下来。它每次评估h,fg以及Arr[B] +=...。这些函数应严格编码cython,而不是一般的Python函数。请参阅memoryview指南中sum3d函数的编译简单性。

+0

谢谢!我现在明白它是怎么回事。由@ user7138814提出的解决方案很好用 – Francisco

2

有几个事情可以做,在用Cython加快数组索引:

所以对于你的函数:

@cython.boundscheck(False) 
@cython.wraparound(False) 
def func2(long N): 

    cdef np.float64_t[::1] sum_arr 
    cdef long i 

    sum_arr = np.zeros(3,dtype=np.float64) 

    for i in range(N): 
     sum_arr[0] += i 
     sum_arr[1] += 2.0*i 
     sum_arr[2] += 3.0*i 

    return sum_arr 

对于原始代码用Cython产生用于线sum_arr[0] += i以下C代码:

__pyx_t_12 = 0; 
__pyx_t_6 = -1; 
if (__pyx_t_12 < 0) { 
    __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape; 
    if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0; 
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0; 
if (unlikely(__pyx_t_6 != -1)) { 
    __Pyx_RaiseBufferIndexError(__pyx_t_6); 
    {__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;} 
} 
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i; 

通过上面的改进:

__pyx_t_8 = 0; 
*((double *) (/* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)))) += __pyx_v_i; 
+0

非常感谢!这确实有效,我获得与func1和func2相同的速度。真棒! – Francisco