在蟒蛇一个网站内刮除多个HTML表中的数据

我想从这个网站上的时间序列为蟒蛇：http://www.boerse-frankfurt.de/en/etfs/db+x+trackers+msci+world+information+technology+trn+index+ucits+etf+LU0540980496/price+turnover+history/historical+data#page=1 在蟒蛇一个网站内刮除多个HTML表中的数据

我已经很远得到了，但不知道如何获取所有数据而不仅仅是您可以在页面上看到的前50行。要在线查看它们，您必须点击表格底部的结果。我希望能够在python中指定开始和结束日期，并获取列表中的所有相应日期和价格。这是我到目前为止：

from bs4 import BeautifulSoup 
import requests 
import lxml 
import re 

url = 'http://www.boerse-frankfurt.de/en/etfs/db+x+trackers+msci+world+information+technology+trn+index+ucits+etf+LU0540980496/price+turnover+history/historical+data' 
soup = BeautifulSoup(requests.get(url).text) 

dates = soup.findAll('td', class_='column-date') 
dates = [re.sub('[\\nt\s]','',d.string) for d in dates] 
prices = soup.findAll('td', class_='column-price') 
prices = [re.sub('[\\nt\s]','',p.string) for p in prices]

来源

2014-09-21 phildeutsch

您需要遍历其余的页面。你可以使用POST请求来做到这一点。服务器期望在每个POST请求中接收一个结构。结构在的值中定义如下。页码是该结构的参数'页'。该结构有几个参数，我没有测试，但可能会有趣的尝试，如items_per_page,max_time和min_time。这里下面是一个例子代码：

from bs4 import BeautifulSoup 
import urllib 
import urllib2 
import re 

url = 'http://www.boerse-frankfurt.de/en/parts/boxes/history/_histdata_full.m' 
values = {'COMPONENT_ID':'PREeb7da7a4f4654f818494b6189b755e76', 
    'ag':'103708549', 
    'boerse_id': '12', 
    'include_url': '/parts/boxes/history/_histdata_full.m', 
    'item_count': '96', 
    'items_per_page': '50', 
    'lang': 'en', 
    'link_id': '', 
    'max_time': '2014-09-20', 
    'min_time': '2014-05-09', 
    'page': 1, 
    'page_size': '50', 
    'pages_total': '2', 
    'secu': '103708549', 
    'template': '0', 
    'titel': '', 
    'title': '', 
    'title_link': '', 
    'use_external_secu': '1'} 

dates = [] 
prices = [] 
while True: 
    data = urllib.urlencode(values) 
    request = urllib.urlopen(url, data) 
    soup = BeautifulSoup(request.read()) 
    temp_dates = soup.findAll('td', class_='column-date') 
    temp_dates = [re.sub('[\\nt\s]','',d.string) for d in temp_dates] 
    temp_prices = soup.findAll('td', class_='column-price') 
    temp_prices = [re.sub('[\\nt\s]','',p.string) for p in temp_prices] 
    if not temp_prices: 
     break 
    else: 
     dates = dates + temp_dates 
     prices = prices + temp_prices 
     values['page'] += 1

来源

2014-09-21 15:23:01

非常感谢，这看起来像正是我要找的。两个问题虽然：你知道如何让这个工作在python3？我已经使用了'data = urllib.parse.urlencode（values） request = urllib.request.urlopen（url，data.encode（'ascii'）） soup = BeautifulSoup（request.read（））'没有工作（我得到相同的日期和价格反复，循环永远不会终止）。另外，你是如何首先提出价值词典的？ – phildeutsch 2014-09-21 15:52:58

您可以使用Python 3和urllib [here。]找到POST请求的示例（https://docs.python.org/3.1/howto/urllib2.html）我认为您需要先创建一个Request对象：'data = urllib.parse.urlencode（values）request = urllib.request.Request（url，data）response = urllib.request.urlopen（request）soup = BeautifulSoup（response.read（））'。我使用FireBug提取字典值，Firefox扩展让您可以在浏览器中看到HTTP请求的内容。 – 2014-09-21 16:09:42

在蟒蛇一个网站内刮除多个HTML表中的数据

回答

相关问题