python：刮去NBA.com的数据

我试图用Python从NBA.com刮取数据，但是当我运行我的代码（如下所示）时等待一段合理的时间后我没有收到回应。python：刮去NBA.com的数据

import requests 
import json 

url_front = 'http://stats.nba.com/stats/leaguedashplayerstats?College=&' + \ 
      'Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&' + \ 
      'DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&' + \ 
      'Location=&MeasureType=' 
url_back = '&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&' + \ 
      'PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&' + \ 
      'PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&' + \ 
      'SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&' + \ 
      'VsConference=&VsDivision=&Weight=' 
#measure_type = ['Base','Advanced','Misc','Scoring','Opponent','Usage','Defense'] 
measure_type = 'Base' 
address = url_front + measure_type + url_back 

# Request the URL, then parse the JSON. 
response = requests.get(address) 
response.raise_for_status()   # Raise exception if invalid response. 
data = response.json()    # JSON decoding.

到目前为止，我试图从博客文章（here）和/或本网站上发布的问题复制代码（Python，R），它们在本质上是相似的，但我最终每个相同的结果时间 - 代码实际上并未成功从URL中提取任何内容。

因为我是网络抓取的新手，我希望能够协助解决这个问题 - 这对于客户端呈现网站（NBA.com）来说很常见，还是说明我的代码/计算机存在问题？无论哪种情况，是否有常见的解决方法/解决方案？

来源

2017-04-23 chbonfield

你有没有试过去在浏览器中的网址？它有一条消息说'MeasureType是必需的' –

该链接应该在浏览器中运行 - 尝试[this]（http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division= ＆DraftPick =＆DraftYear =＆GameScope =＆GameSegment =＆高度=＆LastNGames = 0＆LeagueID = 00＆位置=＆MeasureType =基地＆月= 0＆OpponentTeamID = 0＆结果=＆PORound = 0＆PaceAdjust = N＆PerMode = PerGame＆周期= 0＆PlayerExperience =＆PlayerPosition =＆PlusMinus = N＆排名= N＆季节= 2016-17＆SeasonSegment =＆SeasonType =普通+ Season＆ShotClockRange =＆StarterBench =＆TeamID = 0＆VsConference =＆VsDivision =＆Weight =）以防万一你仍然感兴趣。 – chbonfield

如果您访问浏览器中的链接，您会注意到它可以正常工作。原因是浏览器和requests具有不同的用户代理标题，并且该站点特别阻止了看起来不像来自浏览器的HTTP请求，因为它们不想被刮掉。你可以这样绕过它：

response = requests.get(address, headers={ 
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0', 
})

请记住这一点，不要超载他们的服务器。

来源

2017-04-23 22:43:43

感谢代码/讨论 - 这非常有意义。有没有一种方法可以告诉哪些网站可能需要'request'中的附加信息（比如''User-Agent'，或者其他头文件），还是不管提供给他们更好的做法？ – chbonfield

@chbonfield试错。阻止人们刮蹭的可用资源和动力越多，就会有更多的支票存在，这并不像请求中提供的信息那么简单。例如，如果请求太快，网站通常会怀疑机器人。最终网站可能需要验证码。 –

python：刮去NBA.com的数据

回答

相关问题