正如@josiah Swain所说,它不会很漂亮。对于这类事情,更推荐使用JS,因为它可以理解你拥有的东西。
说到这一点,python是真棒,这里是你的解决方案!
#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
#And one more
import json
# The code you had
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')
# Store the script
script = soup.find_all('script')[17]
# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n')
if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]
# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
.replace('squad.register_players($.parseJSON(\'', '') \
.replace('\'));','')
# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']]
for p in json.loads(cleanJSON)
if p['player'] is not None]
print('position,slot_position,slug')
for line in data:
print(','.join(line))
结果我得到了拷贝和粘贴到蟒蛇是这样的:
position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork
编辑:在反思,这不是一个初学者最容易阅读的代码。这里是一个更容易阅读的版本
# ... All that previous code
script = soup.find_all('script')[17]
allScriptLines = script.text.split('\n')
uncleanJson = None
for line in allScriptLines:
# Remove left whitespace (makes it easier to parse)
cleaner_line = line.lstrip()
if cleaner_line.startswith('squad.register_players($.parseJSON'):
uncleanJson = cleaner_line
cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')
print('position,slot_position,slug')
for player in json.loads(cleanJSON):
if player['player'] is not None:
print(player['position'],player['data']['slot_position'],player['data']['slug'])
你的问题和问题在哪里? – Thecave3