使用beautifulSoup从无标签的标签中刮掉

如果我想从锚标签中的href属性和字符串“Horizontal Zero Dawn”中刮取链接。使用beautifulSoup从无标签的标签中刮掉

由于锚标签没有自己的类，并且在整个源代码中有更多的锚标签。

如何使用beautifulSoup来抓取我需要的数据？

<div class="prodName"> 
<a href="/product.php?sku=123;name=Horizon Zero Dawn">Horizon Zero Dawn</a></div>

来源

2017-05-26 Nitanshu

锚标记没有自己的类没关系。通过查找父div，然后用适当的href财产和文本找到一个锚，我们可以提取所需的两个值：

from bs4 import BeautifulSoup 

page = '<div class="prodName"><a href="/product.php?sku=123;name=Horizon Zero Dawn">Horizon Zero Dawn</a></div>' 

soup = BeautifulSoup(page) 

div = soup.find('div', {'class': 'prodName'}) 
a = div.find('a', {'href': True}, text='Horizon Zero Dawn') 

print a['href'] 
print a.get_text()

此打印：

/product.php?sku=123;name=Horizon Zero Dawn 
Horizon Zero Dawn

编辑：

在评论后更新。如果您在该页面的多个div元素，你需要循环并找到所有存在内每一个，像这样的a元素：

import requests 
from bs4 import BeautifulSoup 

url ='https://in.webuy.com/product.php?scid=1' 
source_code = requests.get(url) 
plain_text = source_code.text 
soup = BeautifulSoup(plain_text,'html.parser') 
for div in soup.findAll('div',{'class':'prodName'}): 
    a = div.findAll('a') 
    for link in a: 
     href = link.get('href') 
     print(href)

来源

2017-05-26 12:05:49 asongtoruin

@ V-ZARD更新答案 – asongtoruin

使用beautifulSoup从无标签的标签中刮掉

回答

相关问题