BeautifulSoup和搜索按类

可能重复：
Beautiful Soup cannot find a CSS class if the object has other classes, too BeautifulSoup和搜索按类

我使用BeautifulSoup找到在HTML tables。我目前遇到的问题是使用class属性中的空格。如果我的HTML读取<html><table class="wikitable sortable">blah</table></html>，我似乎无法用下面的提取它（我在那里能够找到tables同为class都wikipedia和wikipedia sortable）：

BeautifulSoup(html).findAll(attrs={'class':re.compile("wikitable(sortable)?")})

这会发现，如果表虽然我的HTML只是<html><table class="wikitable">blah</table></html>。同样，我已经尝试在我的正则表达式中使用"wikitable sortable"，并且这两者都不匹配。有任何想法吗？

来源

2011-05-04 cryptic_star

如果出现陆续CSS类wikitable，如class="something wikitable other"模式匹配也会失败，所以如果你想，它的类属性包含类wikitable所有的表，你需要接受更多的可能性，这样一个规律：

html = '''<html><table class="sortable wikitable other">blah</table> 
<table class="wikitable sortable">blah</table> 
<table class="wikitable"><blah></table></html>''' 

tree = BeautifulSoup(html) 
for node in tree.findAll(attrs={'class': re.compile(r".*\bwikitable\b.*")}): 
    print node

结果：

<table class="sortable wikitable other">blah</table> 
<table class="wikitable sortable">blah</table> 
<table class="wikitable"><blah></blah></table>

只是为了记录在案，我不使用BeautifulSoup，并喜欢用lxml，正如其他人所提到的。

来源

2011-05-04 22:49:51 samplebias

就像更新一样，BeautifulSoup（bs4）的最新版本可以更加优雅地处理这个问题：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class – Eli 2013-07-22 20:50:28

之一，使得比BeautifulSoup lxml更好的事情是正确的CSS类类选择支持（甚至支持full css selectors，如果你想使用它们）

import lxml.html 

html = """<html> 
<body> 
<div class="bread butter"></div> 
<div class="bread"></div> 
</body> 
</html>""" 

tree = lxml.html.fromstring(html) 

elements = tree.find_class("bread") 

for element in elements: 
    print lxml.html.tostring(element)

给出：

<div class="bread butter"></div> 
<div class="bread"></div>

来源

2011-05-04 22:58:45 Acorn

+1即使这没有帮助@allie写BeautifulSoup代码， lxml远远优越。 – Henry 2011-05-04 23:00:49

虽然我很欣赏那种优雅，但BeautifulSoup已经在这里，而且暂时，这就是我需要使用的。 :) – 2011-05-04 23:21:18

BeautifulSoup和搜索按类

回答

相关问题