的Python美丽的汤怎么得到深层的嵌套元素

我有以下结构的网页：的Python美丽的汤怎么得到深层的嵌套元素

<div id ="a"> 
    <table> 
     <td> 
      <!-- many tables and divs here --> 
     </td> 
     <td> 
      <table></table> 
      <table></table> 
      <div class="tabber"> 
       <table></table> 
       <table></table> <!-- TARGET TABLE --> 
      </div> 
     </td> 
    </table> 
</div>

这是正确的，遗憾的是没有ID或类目标或接近它除了“焊接设备”。

我试图让该div元素：

content = urllib2.urlopen(url).read() 
soup = BeautifulSoup(content) 

stats_div = soup.findAll('div', class_ = "tabber")[1] # 1 because there are 4 elements on page with that class and number 2 is the target one

但它没有工作，总是什么也不输出。

我试图从一开始遍历整个树来获得目标表：

stats_table = soup.find(id='a').findChildren('table')[0].findChildren('td')[1].findChildren('div')[0].findChildren('table')[1]

但它也不起作用。显然findChildren('td')没有得到第一桌的直接子女，而是获得所有的后代。超过100个td元素。

如何获得元素的直接子元素？

有没有更简单的方法来遍历这样一个丑陋的嵌套树？为什么我不能按类选择div？它会简化一切。

来源

2015-03-19 Euphe

你是什么意思*它没有工作*？如果页面中没有这样的div，你会得到*错误*。 – 2015-03-19 12:05:31

@MartijnPieters我收到一个空的列表。我可以得到其他类很好，但这个不起作用。在页面中，类是“tabberlive”，当我尝试获得它时，我得到：http://i.gyazo.com/ab3ceaf1f9250795456d625c7c388960.png – Euphe 2015-03-19 12:14:17

然后在结果汤树中没有这样的元素。这可能有多种原因;向您提供的HTML可能根本就没有那个类（服务器可能根据请求标题改变了响应，或者页面在浏览器中使用脚本进行了更改），或者HTML被破坏，您的解析器没有按照方式修复它你的浏览器（在这种情况下使用不同的解析器）。 – 2015-03-19 12:20:29

没有你显示的代码似乎反映任何页面上：

没有div标签与id='a'。事实上，没有一个具有该属性的标签。这就是为什么你最后的命令stats_table = ...失败。

恰好有3个div标签与类属性等于tabber，不是4：

>>> len(soup.find_all('div', class_="tabber")) 
3

而且他们不为空或者：

>>> len(soup.find_all('div', class_="tabber")[1]) 
7

没有一个单一的div类别tabber的标签只有2 table孩子，但我认为这是因为你大大减少了你自己的例子。

如果你想刮网站，如这一个，你不能很容易地通过一个独特的id选择标签，那么你别无选择，只能帮助自己与其他属性，如标签名称。有时候DOM中的标签位置相互比较也是一种有用的技术。

为了您的具体问题，你可以使用title属性有很大的影响：

>>> from bs4 import BeautifulSoup 
>>> import urllib2 
>>> url = 'http://www.soccerstats.com/team.asp?league=england&teamid=24' 
>>> soup = BeautifulSoup(urllib2.urlopen(url).read(), 'lxml') 
>>> all_stats = soup.find('div', id='team-matches-and stats') 
>>> left_column, right_column = [x for x in all_stats.table.tr.children if x.name == 'td'] 
>>> table1, table2 = [x for x in right_column.children if x.name == 'table'] # the two tables at the top right 
>>> [x['title'] for x in right_column.find_all('div', class_='tabbertab')] 
['Stats', 'Scores', 'Goal times', 'Overall', 'Home', 'Away']

，这里最后一部分是最有趣的部分：所有表右下角有title属性，这将允许你选择他们更容易。此外，这些属性使标签在汤唯一的，这样你就可以从根本上直接选择它们：

>>> stats_div = soup.find('div', class_="tabbertab", title="Stats") 
>>> len(stats_div.find_all('table', class_="stat")) 
3

这3个项目对应于“当前条纹”，“得分王”和“家用/客场优势”子项目。

来源

2015-03-20 17:30:42

的Python美丽的汤怎么得到深层的嵌套元素

回答

相关问题