提取数据 - VoidCC

我想下载一些HTML页面和提取信息，每个HTML页有这样table tag：提取数据

<table class="sobi2Details" style='background-image: url(http://www.imd.ir/components/com_sobi2/images/backgrounds/grey.gif);border-style: solid; border-color: #808080' > 
    <tr> 
     <td><h1>Dr Jhon Doe</h1></td> 
    </tr> 
    <tr> 
     <td></td> 
    </tr> 
    <tr> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
      <div id="sobi2outer"> 
      <br/> 
      <span id="sobi2Details_field_name" ><span id="sobi2Listing_field_name_label">name:</span>Jhon</span><br/> 
      <span id="sobi2Details_field_family" ><span id="sobi2Listing_field_family_label">family:</span> Doe</span><br/> 
      <span id="sobi2Details_field_tel1" ><span id="sobi2Listing_field_tel1_label">tel:</span> 33727464</span><br/> 
      </div> 
     </td> 
    </tr> 
</table>

我想访问的域名（Jhone），家庭（Doe）和电话（33727464 ），我用beausiful soup通过ID来访问这些跨度标签：

name=soup.find(id="sobi2Details_field_name").__str__() 
family=soup.find(id="sobi2Details_field_family").__str__() 
tel=soup.find(id="sobi2Details_field_tel1").__str__()

，但我不知道如何提取数据到这些tags.I tryed使用children和content个属性，但是当我使用的主题为tag它返回None：

name=soup.find(id="sobi2Details_field_name") 
for child in name.children: 
    #process content inside

，但我得到这个错误：当我使用它STR（）

'NoneType' object has no attribute 'children'

同时，它不None !! 任何想法？

编辑：我的最终解决方案

soup = BeautifulSoup(page,from_encoding="utf-8") 
name_span=soup.find(id="sobi2Details_field_name").__str__() 
name=name_span.split(':')[-1] 
result = re.sub('</span>', '',name)

来源

2012-07-28 Asma Gheisari

什么版本的美丽的汤您使用的是？ 'type（name）'返回什么？对我来说它返回。我刚刚在OS X 10.8上的Python 2.7.2上安装了带easy_install的BS4。 – 2012-07-28 13:56:54

我已经在Python 2.6上安装了BS4，我不知道是什么类型（名称），我没有使用它！ – 2012-07-28 14:13:32

type（value）将返回值的类型，因此您可以使用它来帮助解决问题。如果你在'name = soup.find（...）'行后面加上'print type（name）'，你就可以知道BS返回了什么类型的'find'方法的结果。 – 2012-07-28 14:21:12

我发现一对夫妇的方式来做到这一点。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open(path_to_html_file)) 

name_span = soup.find(id="sobi2Details_field_name") 

# First way: split text over ':' 
# This only works because there's always a ':' before the target field 
name = name_span.text.split(':')[1] 

# Second way: iterate over the span strings 
# The element you look for is always the last one 
name = list(name_span.strings)[-1] 

# Third way: iterate over 'next' elements 
name = name_span.next.next.next # you can create a function to do that, it looks ugly :)

告诉我，如果有帮助。

来源

2012-07-28 15:13:42

感谢U.你的第一个方法听起来真的很好，而且工作。但我的html包含unicode，当我测试代码时它有错误。你有任何建议。 – 2012-07-28 17:33:00

你能提供带有错误的回溯吗？ – 2012-07-28 23:36:11

如果您熟悉使用XPath使用LXML与etree代替：

import urllib2 
from lxml import etree 

opener = urllib2.build_opener() 
root = etree.HTML(opener.open("myUrl").read()) 

print root.xpath("//span[@id='sobi2Details_field_name']/text()")[0]

来源

2012-07-28 21:22:01 Joey

回答

相关问题