清理，并与BeautifulSoup

移除标签，我有以下脚本至今：清理，并与BeautifulSoup

from mechanize import Browser 
from BeautifulSoup import BeautifulSoup 
import re 
import urllib2 

br = Browser() 
br.open("http://www.foo.com") 

html = br.response().read(); 

soup = BeautifulSoup(html) 
items = soup.findAll(id="info")

，它运行完美，结果在下面的“项目”：

<div id="info"> 
<span class="customer"><b>John Doe</b></span><br> 
123 Main Street<br> 
Phone:5551234<br> 
<b><span class="paid">YES</span></b> 
</div>

不过，我想借项目和清理，以获得

John Doe 
123 Main Street 
5551234

你怎么能雷莫BeautifulSoup和Python中有这样的标签吗？

一如既往，谢谢！

来源

2010-06-30 Parker

这将为此EXACT html做到这一点。很显然，这不能容忍任何偏差，因此您需要添加相当多的边界检查和空检查，但下面是将数据转换为纯文本的一些细节。

items = soup.findAll(id="info") 
print items[0].span.b.contents[0] 
print items[0].contents[3].strip() 
print items[0].contents[5].strip().split(":", 1)[1]

来源

2010-07-01 00:42:23

谢谢，彼得，这正是我所需要的！ – Parker 2010-07-01 11:37:03

清理，并与BeautifulSoup

回答

相关问题