BeautifulSoup
只能帮助解决问题的一部分 - 找到包含所需对象的期望script
元素。然后,你需要为使用JavaScript分析器,像slimit
,或正则表达式,例如,沿着这些路线的东西:
import json
import re
from bs4 import BeautifulSoup
data = """
<script type="text/javascript">
var jobmap = {};
jobmap[0]= {jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'};
jobmap[1]= {jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'};
</script>"""
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=lambda text: "var jobmap" in text)
pattern = re.compile(r"jobmap\[\d+\]\s*=\s*({.*?})")
for item in pattern.findall(script.get_text(), re.MULTILINE):
print(item)
打印:
{jk:'929a2508c8bf2c9c',efccid: '28d4bd688c1e4e86',srcid:'4beb17a7fc4b64e2',cmpid:'be1c2a3db344744f',num:'0',srcname:'City of Oshawa',cmp:'City of Oshawa',cmpesc:'City of Oshawa',cmplnk:'/City-of-Oshawa-jobs-in-Ontario',loc:'Oshawa, ON',country:'CA',zip:'',city:'Oshawa',title:'Systems Analyst',locid:'da5ca33120fa5fe5',rd:'8i0xAbEkuWUhy6dasPEQkceDzWLtCZmZLj2Y-bGYlQI'}
{jk:'2d06bbaac441e7d2',efccid: 'beb412fe8b0feacc',srcid:'0a0f0bf6b7639c78',cmpid:'0c05d4e9f9f0206d',num:'1',srcname:'FGL Sports Ltd.',cmp:'FGL Sports Ltd.',cmpesc:'FGL Sports Ltd.',cmplnk:'/FGL-Sports-jobs-in-Ontario',loc:'Ontario',country:'CA',zip:'',city:'',title:'Decision Support Analyst',locid:'8b17acc5f001bdbf',rd:'v7_ZQyGHijdq7ng-cswbFDpj7KoE_Ia4YknbAcijYgE'}
注意,每个item
值是不可直接加载与json.loads()
,请使用demjson.decode()
或其他方式查看JavaScript对象字符串加载到P ython字典:
这是一个本发明的课题(https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object)不是数组。 – wpercy
如果您有动态内容,Beautilsoup和urlopen是解决问题的错误方法 –
@ cricket_007我认为这取决于..,有时javascript内容存在于HTML中(通常在脚本标记中),并且有意义的是转到“简单“的urlopen /请求方法,以避免基于浏览器或JavaScript引擎的方式的开销和缓慢。尽管如此,这里通常比较脆弱。这可能不是严格的“错误”,但更像是“谨慎使用和理解”:) – alecxe