发电机功能性能

我想了解发电机功能的性能。我已经使用cProfile和pstats模块来收集和检查分析数据。有问题的功能是这样的：发电机功能性能

def __iter__(self): 
    delimiter = None 
    inData  = self.inData 
    lenData = len(inData) 
    cursor  = 0 
    while cursor < lenData: 
     if delimiter: 
      mo = self.stringEnd[delimiter].search(inData[cursor:]) 
     else: 
      mo = self.patt.match(inData[cursor:]) 
     if mo: 
      mo_lastgroup = mo.lastgroup 
      mstart  = cursor 
      mend   = mo.end() 
      cursor  += mend 
      delimiter = (yield (mo_lastgroup, mo.group(mo_lastgroup), mstart, mend)) 
     else: 
      raise SyntaxError("Unable to tokenize text starting with: \"%s\"" % inData[cursor:cursor+200])

self.inData是一个unicode字符串，self.stringEnd是4层简单的正则表达式的一个字典，self.patt是一个大的正则表达式。整个事情就是将大字符串逐个分成更小的字符串。

剖析使用它，我发现该程序的运行时间最重要的部分在这个函数是花的程序：

In [800]: st.print_stats("Scanner.py:124") 

     463263 function calls (448688 primitive calls) in 13.091 CPU seconds 

    Ordered by: cumulative time 
    List reduced from 231 to 1 due to restriction <'Scanner.py:124'> 

    ncalls tottime percall cumtime percall filename:lineno(function) 
    10835 11.465 0.001 11.534 0.001 Scanner.py:124(__iter__)

但看函数本身的轮廓，没有太多的时间花在在子函数中调用：

In [799]: st.print_callees("Scanner.py:124") 
    Ordered by: cumulative time 
    List reduced from 231 to 1 due to restriction <'Scanner.py:124'> 

Function     called... 
           ncalls tottime cumtime 
Scanner.py:124(__iter__) -> 10834 0.006 0.006 {built-in method end} 
           10834 0.009 0.009 {built-in method group} 
           8028 0.030 0.030 {built-in method match} 
           2806 0.025 0.025 {built-in method search} 
            1 0.000 0.000 {len}

函数的其余部分除了while，assignments和if-else之外没有多少东西。即使在我使用发电机的send方法是快速：

ncalls tottime percall cumtime percall filename:lineno(function) 
13643/10835 0.007 0.000 11.552 0.001 {method 'send' of 'generator' objects}

是否有可能在yield，传值传回给消费者，走的是大部分的时间？还有什么我不知道的？

编辑：

我也许应该提到的是，发电机功能__iter__是一个小类的方法，所以self指的是这个类的一个实例。

来源

2011-06-08 ThomasH

inData有多大？反复切片可能效率不高。也许如果你尝试在itertools中使用islice。看看这是否有所作为。 – Dunes 2011-06-08 20:45:20

@Dunes谢谢，会尝试。性能数据采用大约1MB的字符串。 - 如果你把这个答案放在答案中，我可以放弃它。 – ThomasH 2011-06-08 21:01:59

你有没有试过[this]（http://stackoverflow.com/questions/4295799/how-to-improve-performance-of-this-code/4299378#4299378）？ – 2011-06-09 04:13:22

这实际上是Dunes的答案，不幸的是，它只是将它作为注释给出，似乎并不倾向于将它置于正确的答案中。

主要表现罪魁祸首是字符串切片。一些时间测量显示，切片性能会随着大切片而降低（意味着从已经很大的一串中切下一大片）。要解决，我现在使用pos参数为正则表达式对象的方法：

if delimiter: 
     mo = self.stringEnd[delimiter].search(inData, pos=cursor) 
    else: 
     mo = self.patt.match(inData, pos=cursor)

感谢all谁帮助。

来源

2011-06-09 11:25:16 ThomasH

啊，对不起。这几天我工作很忙。我只知道问题所在，因为我的解决方案不够充分。所以请相信你找到解决方案。 – Dunes 2011-06-11 13:24:57

@Dunes我在问这个问题，所以你的评论是相当充分的。下次：-）。 – ThomasH 2011-06-11 19:53:47

如果正确读取您的示例，那么您正在生成一个生成器对象，将其放入delimiter中，并将其用于数组查找。这可能不是你的速度问题，但我很确定这是一个错误。

来源

2011-06-08 18:17:09

如果你参考'delimiter =（yield ...）'部分，不需要。这个函数是一个**协程**，它允许用户执行'co.send（x）'，它恢复执行（像'next（generator）'）并且使得（yield）...评估为' （如果你只是用它作为一个迭代器，它将评估为“无”IIRC）。 – delnan 2011-06-08 18:28:08

是的，正如delnan写道的，我有时会使用短字符串传递给生成器（使用外部的.send），以使其切换为为下一个块使用不同的正则表达式。 – ThomasH 2011-06-08 18:54:14

发电机功能性能

回答

相关问题