使用Cython优化简单的CPU绑定循环并替换列表

我正在尝试评估一些方法，并且我遇到了性能绊脚石。使用Cython优化简单的CPU绑定循环并替换列表

为什么我的cython代码太慢？我的期望是代码运行速度要快很多（对于只有256 ** 2条目的2d循环，也许是纳秒），而不是毫秒。

这里是我的测试结果：

$ python setup.py build_ext --inplace; python test.py 
running build_ext 
     counter: 0.00236220359802 sec 
     pycounter: 0.00323309898376 sec 
     percentage: 73.1 %

我最初的代码看起来是这样的：

#!/usr/bin/env python 
# encoding: utf-8 
# filename: loop_testing.py 

def generate_coords(dim, length): 
    """Generates a list of coordinates from dimensions and size 
    provided. 

    Parameters: 
     dim -- dimension 
     length -- size of each dimension 

    Returns: 
     A list of coordinates based on dim and length 
    """ 
    values = [] 
    if dim == 2: 
     for x in xrange(length): 
      for y in xrange(length): 
       values.append((x, y)) 

    if dim == 3: 
     for x in xrange(length): 
      for y in xrange(length): 
       for z in xrange(length): 
        values.append((x, y, z)) 

    return values

这适用于我所需要的，但速度很慢。对于给定的暗淡，长度=（2,256），我看到iPython的时间约为2.3ms。

为了加快速度，我开发了一个cython等价物（我认为它是等价的）。

#!/usr/bin/env python 
# encoding: utf-8 
# filename: loop_testing.pyx 
# cython: boundscheck=False 
# cython: wraparound=False 

cimport cython 
from cython.parallel cimport prange 

import numpy as np 
cimport numpy as np 


ctypedef int DTYPE 

# 2D point updater 
cpdef inline void _counter_2d(DTYPE[:, :] narr, int val) nogil: 
    cdef: 
     DTYPE count = 0 
     DTYPE index = 0 
     DTYPE x, y 

    for x in range(val): 
     for y in range(val): 
      narr[index][0] = x 
      narr[index][1] = y 
      index += 1 

cpdef DTYPE[:, :] counter(dim=2, val=256): 
    narr = np.zeros((val**dim, dim), dtype=np.dtype('i4')) 
    _counter_2d(narr, val) 
    return narr 

def pycounter(dim=2, val=256): 
    vals = [] 
    for x in xrange(val): 
     for y in xrange(val): 
      vals.append((x, y)) 
    return vals

和定时的调用：

#!/usr/bin/env python 
# filename: test.py 
""" 
Usage: 
    test.py [options] 
    test.py [options] <val> 
    test.py [options] <dim> <val> 

Options: 
    -h --help  This Message 
    -n    Number of loops [default: 10] 
""" 

if __name__ == "__main__": 
    from docopt import docopt 
    from timeit import Timer 

    args = docopt(__doc__) 
    dim = args.get("<dim>") or 2 
    val = args.get("<val>") or 256 
    n = args.get("-n") or 10 
    dim = int(dim) 
    val = int(val) 
    n = int(n) 

    tests = ['counter', 'pycounter'] 
    timing = {} 
    for test in tests: 
     code = "{}(dim=dim, val=val)".format(test) 
     variables = "dim, val = ({}, {})".format(dim, val) 
     setup = "from loop_testing import {}; {}".format(test, variables) 
     t = Timer(code, setup=setup) 
     timing[test] = t.timeit(n)/n 

    for test, val in timing.iteritems(): 
     print "{:>20}: {} sec".format(test, val) 
    print "{:>20}: {:>.3} %".format("percentage", timing['counter']/timing['pycounter'] * 100)

而对于参考，setup.py构建用Cython代码：

from distutils.core import setup 
from Cython.Build import cythonize 
import numpy 

include_path = [numpy.get_include()] 

setup(
    name="looping", 
    ext_modules=cythonize('loop_testing.pyx'), # accepts a glob pattern 
    include_dirs=include_path, 
)

编辑： 链接到工作版本：https://github.com/brianbruggeman/cython_experimentation

来源

2015-06-03 Brian Bruggeman

你的cython代码非常好。如果使用'narr [index] [0] = x'实际上并没有执行赋值操作（并且执行缓慢的C python API调用），请使用'narr [index，0] = x'（对于纯numpy同样如此）。另外，尝试在你的'setup.py'中设置'extra_compile_args = ['-O3'，'-march = native']'和'extra_link_args = [' - O3'，'-march = native']'这应该会加快速度向上。 – rth

谢谢！我会试试这个。 –

@rth'narr [index，0]'明确地解决了这个问题。我现在的速度大概是100倍。我没有看到多余的编译/链接选项的变化。但是，我不介意在这一点上留下这些内容。万分感谢！您想添加答案吗？ –

这用Cython代码是因为narr[index][0] = x分配，这在很大程度上依赖于Python的C-API的慢。相反，使用narr[index, 0] = x将转换为纯C，并解决此问题。

正如@perimosocordiae指出的那样，使用带注释的cythonize绝对是调试此类问题的方法。

在某些情况下，它也可以是值得明确指定在setup.py的GCC编译标志，

setup(
    [...] 
    extra_compile_args=['-O2', '-march=native'], 
    extra_link_args=['-O2', '-march=native'])

这不应该是必要的，合理的假设默认的编译选项。但是，例如，在我的Linux系统上，默认情况下看起来没有任何优化，并添加了上述标志，从而显着提高了性能。

来源

2015-06-05 07:33:27 rth

感谢您的帮助！ –

看起来你的Cython代码使用numpy数组做了一些奇怪的事情，并没有真正利用C编译。要查看生成的代码，运行

cython -a loop_testing.pyx

如果你避免numpy的部分，做了Python功能的一个简单的用Cython翻译发生什么呢？

编辑：它看起来像你可以完全避免Cython的一个相当不错的加速。（〜30X我的机器上）

def npcounter(dim=2, val=256): 
    return np.indices((val,)*dim).reshape((dim,-1)).T

来源

2015-06-03 21:50:17 perimosocordiae

这是我的下一步。如果可以的话，我真的很想避免malloc。我正在使用numpy来分配内存。 –

您可以使用Cython制作列表并追加到它们。从那里开始，在你进入malloc之前。 – perimosocordiae

我以为我试图避免列表...附加在python暗示我使用python解释器添加到python对象。我不想使用我一直在阅读/看到的东西。 https://gist.github.com/brianbruggeman/625e488777722e852e6c 没有明显的区别。 –

使用Cython优化简单的CPU绑定循环并替换列表

回答

相关问题