Python BS4与SDMX

我想检索SDMX文件（如https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx&mode=its）中给出的数据。我试图使用BeautifulSoup，但看起来，它没有看到标签。在下面的代码Python BS4与SDMX

import urllib2 
from bs4 import BeautifulSoup 
url = "https://www.bundesbank.de/cae/servlet/StatisticDownload?tsId=BBK01.ST0304&its_fileFormat=sdmx" 
html_source = urllib2.urlopen(url).read() 
soup = BeautifulSoup(html_source, 'lxml') 
ts_series = soup.findAll("bbk:Series")

这给了我一个空的对象。

是BS4错误的工具，或者（更可能）我做错了什么？在此先感谢

来源

2016-09-16 Daniel

提供的网址显示“您的请求无法处理！”，也许只是粘贴一些XML片段会有帮助。 – flyingfoxlee

<？XML版本= “1.0” 编码= “UTF-8”？> 不知道如何将它正确格式化这里。对不起 – Daniel

你说得对，但url在Python中是可读的，至少在我的系统上 – Daniel

soup.findAll("bbk:series")会返回结果。实际上，在这种情况下，即使您使用lxml作为解析器，BeautifulSoup仍然将其解析为html，因为html标签大小写不敏感，BeautifulSoup会降低所有标签，因此soup.findAll("bbk:series")有效。请参阅官方文档中的Other parser problems。

如果您想将其解析为xml，请改为使用soup = BeautifulSoup(html_source, 'xml')。它还使用lxml，因为lxml是唯一的xml解析器BeautifulSoup。现在您可以使用ts_series = soup.findAll("Series")获得结果，因为beautifulSoup将剥离名称空间部分bbk。

来源

2016-09-16 13:58:47 flyingfoxlee

Oh dayum。非常感谢。你在2小时无用的测试后拯救了我的一天:) – Daniel

Python BS4与SDMX

回答

相关问题