2016-02-12 35 views
1

我想从一堆html表格中提取一些数据价格。这些表格包含各种价格,当然表格数据标签不包含任何有用的东西。查找兄弟元素的文本,其中原始元素与特定字符串匹配

<div id="item-price-data"> 
    <table> 
    <tbody> 
     <tr> 
     <td class="some-class">Normal Price:</td> 
     <td class="another-class">$100.00</td> 
     </tr> 
     <tr> 
     <td class="some-class">Member Price:</td> 
     <td class="another-class">$90.00</td> 
     </tr> 
     <tr> 
     <td class="some-class">Sale Price:</td> 
     <td class="another-class">$80.00</td> 
     </tr> 
     <tr> 
     <td class="some-class">You save:</td> 
     <td class="another-class">$20.00</td> 
     </tr> 
    </tbody> 
    </table> 
</div> 

我唯一关心的价格是那些与具有“正常价格”的元素配对的价格,因为它是文本。

我想要做的是扫描表的后代,找到包含该文本的<td>标签,然后从其兄弟中拉出文本。

我遇到的问题是,在BeautifulSoup descendants属性返回的列表NavigableString,而不是Tag

所以,如果我这样做:

from bs4 import BeautifulSoup 
from urllib import request 

html = request.urlopen(url) 
soup = BeautifulSoup(html, 'lxml') 

div = soup.find('div', {'id': 'item-price-data'}) 
table_data = div.find_all('td') 

for element in table_data: 
    if element.get_text() == 'Normal Price:': 
     price = element.next_sibling 

print(price) 

我什么也没得到。有没有简单的方法来获取字符串值?

+0

我只是跑这和我' $ 100.00';我错过了什么吗? –

+0

是的。有些事我也没有得到。我发现'Tag'在那里,但它不是下一个兄弟姐妹。下一个兄弟是回车。 –

回答

0

可以使用find_next()方法还可能需要一点正则表达式:

演示:

>>> import re 
>>> from bs4 import BeautifulSoup 
>>> html = """<div id="item-price-data"> 
... <table> 
...  <tbody> 
...  <tr> 
...   <td class="some-class">Normal Price:</td> 
...   <td class="another-class">$100.00</td> 
...  </tr> 
...  <tr> 
...   <td class="some-class">Member Price:</td> 
...   <td class="another-class">$90.00</td> 
...  </tr> 
...  <tr> 
...   <td class="some-class">Sale Price:</td> 
...   <td class="another-class">$80.00</td> 
...  </tr> 
...  <tr> 
...   <td class="some-class">You save:</td> 
...   <td class="another-class">$20.00</td> 
...  </tr> 
...  </tbody> 
... </table> 
... </div>""" 
>>> soup = BeautifulSoup(html, 'lxml') 
>>> div = soup.find('div', {'id': 'item-price-data'}) 
>>> for element in div.find_all('td', text=re.compile('Normal Price')): 
...  price = element.find_next('td') 
...  print(price) 
... 
<td class="another-class">$100.00</td> 

如果你不希望把正则表达式这个那么下面会为你工作。

>>> table_data = div.find_all('td') 
>>> for element in table_data: 
...  if 'Normal Price' in element.get_text(): 
...   price = element.find_next('td') 
...   print(price) 
... 
<td class="another-class">$100.00</td> 
相关问题