2016-09-19 34 views
0

有一个website带有一些我想从中提取数据的交互式图表。我在使用selenium webdriver的python之前编写了几个web scraper,但这似乎是一个不同的问题。我已经看了一些关于stackoverflow的类似问题。从这些看来,解决方案可能是直接从json文件下载数据。我查看了网站的源代码并确定了几个json文件,但经过检查,他们似乎没有包含这些数据。从交互式图表中刮掉数据

有谁知道如何从这些图表下载数据?特别是我感兴趣的这个柱状图中:.//*[@id='network_download']

感谢

编辑:我要补充的是,当我使用Firebug检查的网站,我看到炎可能以以下格式获取数据。但是这显然没有帮助,因为它不包含任何标签。

<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;"> 

回答

0

像这样的SVG图表往往有点难以刮取。只有用鼠标实际悬停各个元素后,才会显示您想要的数字。

要得到你需要

  1. 数据查找所有点
  2. 对于dots_list每个点的列表中,单击或悬停(动作链)网点
  3. 刮在工具提示中值弹出

这个工作对我来说:

from __future__ import print_function 

from pprint import pprint as pp 

from selenium import webdriver 
from selenium.webdriver.common.action_chains import ActionChains 


def main(): 
    driver = webdriver.Chrome() 
    ac = ActionChains(driver) 

    try: 
     driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/") 

     dots_css = "div#network_download g g.dots_container circle" 
     dots_list = driver.find_elements_by_css_selector(dots_css) 

     print("Found {0} data points".format(len(dots_list))) 

     download_speeds = list() 
     for index, _ in enumerate(dots_list, 1): 
      # Because this is an SVG chart, and because we need to hover it, 
      # it is very likely that the elements will go stale as we do this. For 
      # that reason we need to require each dot element right before we click it 
      single_dot_css = dots_css + ":nth-child({0})".format(index) 
      dot = driver.find_element_by_css_selector(single_dot_css) 
      dot.click() 

      # Scrape the text from the popup 
      popup_css = "div#network_download div.tooltip" 
      popup_text = driver.find_element_by_css_selector(popup_css).text 
      pp(popup_text) 
      rank, comp_and_country, speed = popup_text.split("\n") 
      company, country = comp_and_country.split(" in ") 
      speed_dict = { 
       "rank": rank.split(" Globally")[0].strip("#"), 
       "company": company, 
       "country": country, 
       "speed": speed.split("Download speed: ")[1] 
      } 
      download_speeds.append(speed_dict) 

      # Hover away from the tool tip so it clears 
      hover_elem = driver.find_element_by_id("network_download") 
      ac.move_to_element(hover_elem).perform() 

     pp(download_speeds) 

    finally: 
     driver.quit() 

if __name__ == "__main__": 
    main() 

样本输出:

(.venv35) ➜ stackoverflow python svg_charts.py 
Found 182 data points 
'#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps' 
'#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps' 
'#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps' 
'#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps' 
'#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps' 
'#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps' 
'#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps' 
'#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps' 
'#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps' 
'#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps' 
'#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps' 
'#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps' 
'#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps' 
<...> 
[{'company': 'SingTel', 
    'country': 'Singapore', 
    'rank': '1', 
    'speed': '40 Mbps'}, 
{'company': 'StarHub', 
    'country': 'Singapore', 
    'rank': '2', 
    'speed': '39 Mbps'}, 
{'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'} 
... 
] 

应当注意的是,你在问题中所引用的值,在圈内的元素,并不是特别有用,因为这些只是说明如何在SVG图表中画出点。