2017-05-30 66 views
0

我是能够得到使用跨度的需要名单如下:Beautifulsoup 4个跨度方含“@”返回奇怪的结果

attrs = soup.find_all("span") 

这将返回跨度为键和值的列表:

[ 
    <span>back camera resolution</span>, 
    <span class="even">12 MP</span> 
] 

[ 
    <span>front camera resolution</span>, 
    <span class="even">16 MP</span> 
] 

[ 
    <span>video resolution</span>, 
    <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> 
    </span> 
] 

原始的HTML是这样的:

enter image description here

为什么“视频分辨率“是这样转换的?

+0

不要混淆DOM查看器与服务器给浏览器的源。 BeautifulSoup无法执行服务器发送的Javascript代码。 –

+0

它看起来像服务器使用Javascript库自动混淆*电子邮件地址*,JavaScript代码由浏览器执行重新插入文本。 –

+0

@MartijnPieters哇!如果这是复杂的,我认为这不是那么重要,我会跳过它。谢谢。 –

回答

3

该网站使用的是CloudFlare email protection feature,这似乎已经取代所有字符串在他们@处理混淆(XOR加密)值,以防止收获的电子邮件地址刮刀。每个替换包括解码它的JavaScript代码。

BeautifulSoup不会执行Javascript,但您的浏览器已执行并将<a class="__cf_email__">标记替换为生成的解密数据。

你可以用一个小的Python 3函数来做同样的事情;所有的JavaScript代码不通过使用第一字节作为一个简单的XOR解密例程密钥“解密”的(十六进制编码的)值:

def decode(cfemail): 
    enc = bytes.fromhex(cfemail) 
    return bytes([c^enc[0] for c in enc[1:]]).decode('utf8') 

def deobfuscate_cf_email(soup): 
    for encrypted_email in soup.select('a.__cf_email__'): 
     decrypted = decode(encrypted_email['data-cfemail']) 
     # remove the <script> tag from the tree 
     script_tag = encrypted_email.find_next_sibling('script') 
     script_tag.decompose() 
     # replace the <a class="__cf_email__"> tag with the decoded result 
     encrypted_email.replace_with(decrypted) 

为了使上述工作在Python 2,用替换bytesbytearray

演示:

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(''' 
...  <span>video resolution</span>, 
...  <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script> 
...  </span> 
... ''') 
>>> deobfuscate_cf_email(soup) 
>>> soup 
<html><body><span>video resolution</span>, 
    <span class="even">[email protected] - [email protected] - [email protected] 
</span> 
</body></html>