使用BeautifulSoup和Python从HTML文件中提取数据

我需要从HTML文件中提取数据。有问题的文件很可能是自动生成的。我已将其中一个文件的代码上传到Pastebin：http://pastebin.com/9Nj2Edfv。这是指向实际页面的链接：http://eur-lex.europa.eu/Notice.do?checktexts=checkbox&val=60504%3Acs&pos=1&page=1&lang=en&pgs=10&nbl=1&list=60504%3Acs%2C&hwords=&action=GO&visu=%23texte 使用BeautifulSoup和Python从HTML文件中提取数据

我需要提取的数据可以在不同的标题下找到。

这是我到目前为止有：

from BeautifulSoup import BeautifulSoup 
ecj_data = open("data\ecj_1.html",'r').read() 

soup = BeautifulSoup(ecj_data) 

celex = soup.find('h1') 
auth_lang = soup('ul', limit=14)[13].li 
procedure = soup('ul', limit=20)[17].li 

print "Celex number:", celex.renderContents(), 
print "Authentic language:", auth_lang 
print "Type of procedure:", procedure

我把所有的数据存储在本地是它打开文件ecj_1.html的原因。

Celex号码和Authentic语言的作品有点不错。

CELEX返回

"Celex number: 
61977J0059"

auth_lang返回"Authentic language: <li>French</li>"

我需要h1标签（未在年底突破）的内容之外。

[此外，我需要auth_lang返回只是“法国”，而不是<li> - 标签。] 这不再是一个问题。我意识到我可以在“auth_lang”的末尾添加“.text”。在另一方面

过程返回此：

Type of procedure: <li> 
    <strong>Type of procedure:</strong> 
    <br /> 
    Reference for a preliminary ruling 
    </li>

这是相当错误的，因为我只需要它返回“参考了初步裁决”。

有什么办法可以实现这个目标吗？

第二个编辑：我换成celex = soup.find('h1')与celex = soup('h1', limit=2)[0]，并添加.text到打印CELEX。

来源

2012-03-20 A2D2

找到的每个序列的内容都是列表，只有前两个是长度1.但是procedure是5个元素长，并且您在此之后（在这种情况下）的条目是第4个。我用splitlines()来摆脱换行符。

print "Celex number:", celex.contents[0].splitlines()[1] 
print "Authentic language:", auth_lang.contents[0].splitlines()[0] 
print "Type of procedure:", procedure.contents[4].splitlines()[1]

输出：

Celex number: 61977J0059 
Authentic language: French 
Type of procedure: Reference for a preliminary ruling

来源

2012-03-20 14:42:31 fraxel

飞梭：非常感谢你！它像一个魅力。这个想法是以某种方式将此文件的输出传输到数据库。我相信当你向我展示如何摆脱换行符时，你可能已经解决了将来的问题，因为他们可能会在稍后解决问题。再次感谢！ – A2D2 2012-03-20 14:45:44

使用BeautifulSoup和Python从HTML文件中提取数据

回答

相关问题