如何从网页的JSON/Javascript中抓取数据？

我是Python新手，刚开始使用它。
我的系统环境是Python 3.5，某些库位于Windows10。如何从网页的JSON/Javascript中抓取数据？

我想从下面的网站中提取足球运动员数据作为CSV文件。

问题：我无法将数据从soup.find_all('script')[17]提取到我预期的CSV格式。如何根据需要提取这些数据？

我的代码如下所示。

from bs4 import BeautifulSoup 
import re 
from urllib.request import Request, urlopen 

req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'}) 
webpage = urlopen(req).read() 
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml 
soup.find_all('script')[17] #My target data is in 17th

我预计产量将与此类似

position,slot_position,slug 
ST,ST,paulo-henrique 
LM,LM,mugdat-celik

来源

2017-10-07 nisahc

你的问题和问题在哪里？ – Thecave3

正如@josiah Swain所说，它不会很漂亮。对于这类事情，更推荐使用JS，因为它可以理解你拥有的东西。

说到这一点，python是真棒，这里是你的解决方案！

#Same imports as before 
from bs4 import BeautifulSoup 
import re 
from urllib.request import Request, urlopen 

#And one more 
import json 

# The code you had 
req = Request('http://www.futhead.com/squad-building-challenges/squads/343', 
       headers={'User-Agent': 'Mozilla/5.0'}) 
webpage = urlopen(req).read() 
soup = BeautifulSoup(webpage,'html.parser') 

# Store the script 
script = soup.find_all('script')[17] 

# Extract the oneline that stores all that JSON 
uncleanJson = [line for line in script.text.split('\n') 
     if line.lstrip().startswith('squad.register_players($.parseJSON') ][0] 

# The easiest way to strip away all that yucky JS to get to the JSON 
cleanJSON = uncleanJson.lstrip() \ 
         .replace('squad.register_players($.parseJSON(\'', '') \ 
         .replace('\'));','') 

# Extract out that useful info 
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
     for p in json.loads(cleanJSON) 
     if p['player'] is not None] 


print('position,slot_position,slug') 
for line in data: 
    print(','.join(line))

结果我得到了拷贝和粘贴到蟒蛇是这样的：

position,slot_position,slug 
ST,ST,paulo-henrique 
LM,LM,mugdat-celik 
CAM,CAM,soner-aydogdu 
RM,RM,petar-grbic 
GK,GK,fatih-ozturk 
CDM,CDM,eray-ataseven 
LB,LB,kadir-keles 
CB,CB,caner-osmanpasa 
CB,CB,mustafa-yumlu 
RM,RM,ioan-adrian-hora 
GK,GK,bora-kork

编辑：在反思，这不是一个初学者最容易阅读的代码。这里是一个更容易阅读的版本

# ... All that previous code 
script = soup.find_all('script')[17] 

allScriptLines = script.text.split('\n') 

uncleanJson = None 
for line in allScriptLines: 
    # Remove left whitespace (makes it easier to parse) 
    cleaner_line = line.lstrip() 
    if cleaner_line.startswith('squad.register_players($.parseJSON'): 
      uncleanJson = cleaner_line 

cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','') 

print('position,slot_position,slug') 
for player in json.loads(cleanJSON): 
    if player['player'] is not None: 
     print(player['position'],player['data']['slot_position'],player['data']['slug'])

来源

2017-10-08 02:50:23 Splatmistro

这是非常成功的，非常感谢你的时间来解释我如何解决这个问题。在阅读你的代码之后，刚开始学习Python的初学者并不容易。 – nisahc

所以我的理解是，beautifulsoup是HTML解析更好，但你正在试图解析JavaScript的嵌套在HTML。

所以，你有两个选择

只需创建一个函数，它soup.find_all（“脚本”）[17]，循环的结果，手动搜索的字符串的数据，并提取其。您甚至可以使用ast.literal_eval（string_thats_really_a_dictionary）以使其更容易。这可能不是最好的方法，但是如果你是python的新手，你可能想要这样做只是为了练习。
Use the json library like in this example.或alternatively like this way.这可能是更好的方法。

来源

2017-10-07 16:51:25

你能否给我一些这个问题的示例代码？ – nisahc

如何从网页的JSON/Javascript中抓取数据？

回答

相关问题