Python（BeautifulSoup） - 从<script>获取href

我正在研究“Video Downloader”，并且我有一个BeautifulSoup4问题。Python（BeautifulSoup） - 从<script>获取href

这里是HTML的一部分，从我希望得到A HREF：

<script src="/static/common.js?v7"></script> 
<script type="text/javascript"> 
      var c = 6; 
      window.onload = function() { 
       count(); 
      } 

      function closeAd(){ 
       $("#easy-box").hide(); 
      } 

      function notLogedIn(){ 
       $("#not-loged-in").html("You need to be logged in to download this movie!"); 
      } 

      function count() { 
       if(document.getElementById('countdown') != null){ 
        c -= 1; 
        //If the counter is within range we put the seconds remaining to the <span> below 
        if (c >= 0) 
         if(c == 0){ 
          document.getElementById('countdown').innerHTML = ''; 
         } 
         else { 
          document.getElementById('countdown').innerHTML = c; 
         } 
        else { 
         document.getElementById('download-link').innerHTML = '<a style="text-decoration:none;" href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi">Click here</a> to download requested file.'; 
         return; 
        }   
        //setTimeout('count()', 1000); 
       } 
      } 
     </script> 
<script type="text/javascript" src="/static/flowplayer/flowplayer-3.2.13.min.js"></script>

这里是HREF我要打印：

href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi"

我这个尝试，但它的不工作。

for a in soup3.find_all('a'): 
    if 'href' in a.attrs: 
     print(a['href'])

来源

2017-06-16 jestembotem

该href是JavaScript内。您可以抓住js部分并在[regex]（https://docs.python.org/3/howto/regex.html）的帮助下提取href。看看这个[问题]（https://stackoverflow.com/questions/24333189/parsing-js-with-beautiful-soup） – trotta

美丽的汤可以解析HTML和XML，而不是JavaScript。您可以使用正则表达式来搜索此代码。
使用<a [^>]*?(href=\"([^\">]+)\")可以匹配这个代码里面的一切：

<a - 是a标签
[^>]*? - 可以有不>
href="任何字符 - 有HREF
[^\">]+ - 除"和>之外还有任意数量的字符

从HTML中提取的脚本代码可以使用
script = soup.find('script', {'type': 'text/javascript'})
，然后分析它，使用
re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)
记住import re第一。

print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[1]) 
# href="http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi 
print(re.search(r"<a [^>]*?(href=\"([^\">]+)\")", script.text)[2]) 
# http://s896.vshare.io/download,9999999999999999999999999999999999999999-f6192405453bf5ff3cfe41a488d8390d,5944ed28,4d948c5.avi

阅读正则表达式。如果您要经常使用模式，请先编译它。
https://docs.python.org/3/library/re.html

来源

2017-06-16 09:40:38 Szymon

谢谢你的回答，但我有一个错误：'print（re.search （r“] *？（href = \”（[^ \>>] +））\“”，script.text）[1]） AttributeError：'NoneType'对象没有'text''属性 – jestembotem

像BS一样没有找到任何'script'。你确定你使用了'soup.find（）'函数的适当参数吗？ – Szymon

现在我得到了这个错误'print（re.search（r“）*？（href = \“（[^ \”>] +））\“”，script.text）[1]） TypeError：'NoneType'对象不可自订' – jestembotem

Python（BeautifulSoup） - 从<script>获取href

回答

相关问题