2017-10-07 71 views
1

我是Python新手,刚开始使用它。
我的系统环境是Python 3.5,某些库位于Windows10如何从网页的JSON/Javascript中抓取数据?

我想从下面的网站中提取足球运动员数据作为CSV文件。

问题:我无法将数据从soup.find_all('script')[17]提取到我预期的CSV格式。如何根据需要提取这些数据?

我的代码如下所示。

from bs4 import BeautifulSoup 
import re 
from urllib.request import Request, urlopen 

req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'}) 
webpage = urlopen(req).read() 
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml 
soup.find_all('script')[17] #My target data is in 17th 

我预计产量将与此类似

position,slot_position,slug 
ST,ST,paulo-henrique 
LM,LM,mugdat-celik 
+0

你的问题和问题在哪里? – Thecave3

回答

0

正如@josiah Swain所说,它不会很漂亮。对于这类事情,更推荐使用JS,因为它可以理解你拥有的东西。

说到这一点,python是真棒,这里是你的解决方案!

#Same imports as before 
from bs4 import BeautifulSoup 
import re 
from urllib.request import Request, urlopen 

#And one more 
import json 

# The code you had 
req = Request('http://www.futhead.com/squad-building-challenges/squads/343', 
       headers={'User-Agent': 'Mozilla/5.0'}) 
webpage = urlopen(req).read() 
soup = BeautifulSoup(webpage,'html.parser') 

# Store the script 
script = soup.find_all('script')[17] 

# Extract the oneline that stores all that JSON 
uncleanJson = [line for line in script.text.split('\n') 
     if line.lstrip().startswith('squad.register_players($.parseJSON') ][0] 

# The easiest way to strip away all that yucky JS to get to the JSON 
cleanJSON = uncleanJson.lstrip() \ 
         .replace('squad.register_players($.parseJSON(\'', '') \ 
         .replace('\'));','') 

# Extract out that useful info 
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
     for p in json.loads(cleanJSON) 
     if p['player'] is not None] 


print('position,slot_position,slug') 
for line in data: 
    print(','.join(line)) 

结果我得到了拷贝和粘贴到蟒蛇是这样的:

position,slot_position,slug 
ST,ST,paulo-henrique 
LM,LM,mugdat-celik 
CAM,CAM,soner-aydogdu 
RM,RM,petar-grbic 
GK,GK,fatih-ozturk 
CDM,CDM,eray-ataseven 
LB,LB,kadir-keles 
CB,CB,caner-osmanpasa 
CB,CB,mustafa-yumlu 
RM,RM,ioan-adrian-hora 
GK,GK,bora-kork 

编辑:在反思,这不是一个初学者最容易阅读的代码。这里是一个更容易阅读的版本

# ... All that previous code 
script = soup.find_all('script')[17] 

allScriptLines = script.text.split('\n') 

uncleanJson = None 
for line in allScriptLines: 
    # Remove left whitespace (makes it easier to parse) 
    cleaner_line = line.lstrip() 
    if cleaner_line.startswith('squad.register_players($.parseJSON'): 
      uncleanJson = cleaner_line 

cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','') 

print('position,slot_position,slug') 
for player in json.loads(cleanJSON): 
    if player['player'] is not None: 
     print(player['position'],player['data']['slot_position'],player['data']['slug']) 
+0

这是非常成功的,非常感谢你的时间来解释我如何解决这个问题。在阅读你的代码之后,刚开始学习Python的初学者并不容易。 – nisahc

0

所以我的理解是,beautifulsoup是HTML解析更好,但你正在试图解析JavaScript的嵌套在HTML。

所以,你有两个选择

  1. 只需创建一个函数,它soup.find_all(“脚本”)[17],循环的结果,手动搜索的字符串的数据,并提取其。您甚至可以使用ast.literal_eval(string_thats_really_a_dictionary)以使其更容易。这可能不是最好的方法,但是如果你是python的新手,你可能想要这样做只是为了练习。
  2. Use the json library like in this example.alternatively like this way.这可能是更好的方法。
+0

你能否给我一些这个问题的示例代码? – nisahc