Beautifulsoup4与LXML VS Beautifulsoup3

我迁移从BeautifulSoup3一些解析器BeautifulSoup4，我认为这将是一个好主意来分析如何更快它会得到考虑LXML是超级快，这是我与BS4使用分析器，在这里是曲线结果：Beautifulsoup4与LXML VS Beautifulsoup3

对于BS3：

43208 function calls (42654 primitive calls) in 0.103 seconds 

Ordered by: standard name 

ncalls tottime percall cumtime percall filename:lineno(function) 
    1 0.000 0.000 0.000 0.000 <string>:2(<module>) 
    18 0.000 0.000 0.000 0.000 <string>:8(__new__) 
    1 0.000 0.000 0.072 0.072 <string>:9(parser) 
    32 0.000 0.000 0.000 0.000 BeautifulSoup.py:1012(__init__) 
    1 0.000 0.000 0.000 0.000 BeautifulSoup.py:1018(buildTagMap) 
...

对于BS4使用LXML：

164440 function calls (163947 primitive calls) in 0.244 seconds 

Ordered by: standard name 

ncalls tottime percall cumtime percall filename:lineno(function) 
    1 0.040 0.040 0.069 0.069 <string>:2(<module>) 
    18 0.000 0.000 0.000 0.000 <string>:8(__new__) 
    1 0.000 0.000 0.158 0.158 <string>:9(parser) 
    1 0.000 0.000 0.008 0.008 HTMLParser.py:1(<module>) 
    1 0.000 0.000 0.000 0.000 HTMLParser.py:54(HTMLParseError) 
...

为什么BS4呼吁4恬es更多的功能？为什么它使用HTMLParser，如果我将其设置为使用lxml？

最引人注目的事情，我从BS3改为BS4是这样的：

BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES) ---> 
BeautifulSoup(html, 'lxml') 

[x.getText('**SEP**') for x in i.findChildren('font')[:2]] ---> 
[x.getText('**SEP**', strip=True) for x in i.findChildren('font')[:2]]

一切仅仅是一些名称的变化（如findParent - > find_parent）

编辑：

我编辑2：

这里是一个小的代码示例尝试一下：

from cProfile import Profile 

from BeautifulSoup import BeautifulSoup 
from bs4 import BeautifulSoup as BS4 
import urllib2 


def parse(html): 

    soup = BS4(html, 'lxml') 
    hl = soup.find_all('span', {'class': 'mw-headline'}) 
    return [x.get_text(strip=True) for x in hl] 


def parse3(html): 

    soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES) 
    hl = soup.findAll('span', {'class': 'mw-headline'}) 
    return [x.getText() for x in hl] 


if __name__ == "__main__": 
    opener = urllib2.build_opener() 
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] 
    html = ''.join(opener.open('http://en.wikipedia.org/wiki/Price').readlines()) 

    profiler = Profile() 
    print profiler.runcall(parse, html) 
    profiler.print_stats() 

    profiler2 = Profile() 
    print profiler2.runcall(parse3, html) 
    profiler2.print_stats()

来源

2012-07-02 Hassek

我们无法重现你的结果，如果你不给我们一个样本网址与展示这个问题而努力。（另外，你判断是否lxml.html出现此问题，或仅BS4？） –

只有BS4，未与LXML独自尝试这个。让我来创建一个简单的例子真正的快速所以你们可以重现 – Hassek

OK只是增加了一个小例子让每个人都可以尝试一下 – Hassek

我认为主要的问题是在美丽的汤4中的错误我已经filed it和修补程序将在下一版本发布。感谢您的发现。

这么说，我有，不知道为什么你的个人资料提到的HTMLParser类都因为你使用LXML。

来源

2012-07-02 20:05:48

是的，并在维基百科测试没有任何显示出来。感谢您将它指向为一个错误，我希望这个问题能很快得到解决！ – Hassek

Beautifulsoup4与LXML VS Beautifulsoup3

回答

相关问题