用Python和美丽的汤刮网

我正在练习建设web刮板。我现在正在开展的一项工作涉及到一个网站，为该网站上的各个城市刮取链接，然后为每个城市提供所有链接，并在所述链接中抓取所有链接。用Python和美丽的汤刮网

我用下面的代码：

import requests 

from bs4 import BeautifulSoup 

main_url = "http://www.chapter-living.com/" 

# Getting individual cities url 
re = requests.get(main_url) 
soup = BeautifulSoup(re.text, "html.parser") 
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally 
cities_links = [main_url + tag["href"] for tag in city_tags.find_all("a")] # Links to cities

如果我打印出来city_tags我得到我想要的HTML。但是，当我打印cities_links时，我得到AttributeError: 'ResultSet' object has no attribute 'find_all'。

我从其他q的收集在这里，发生此错误，因为city_tags返回无，但如果它打印出所需的html不能这样的情况？我已经注意到，说html是[] - 这是否有所作为？

来源

2017-03-16 Maverick

正如错误所说，city_tags为ResultSet是节点列表，它并没有find_all方法，您既可以通过设置有循环和每个节点上或在申请find_all您情况下，我想你可以简单地从每个节点提取href属性：

[tag['href'] for tag in city_tags] 

#['https://www.chapter-living.com/blog/', 
# 'https://www.chapter-living.com/testimonials/', 
# 'https://www.chapter-living.com/events/']

来源

2017-03-16 17:50:47 Psidom

好city_tags是标记的bs4.element.ResultSet（本质上是一个列表），你就可以调用find_all。您可能想要在结果集的每个元素中调用find_all，或者在此特定情况下只检索它们的href属性

import requests 
from bs4 import BeautifulSoup 

main_url = "http://www.chapter-living.com/" 

# Getting individual cities url 
re = requests.get(main_url) 
soup = BeautifulSoup(re.text, "html.parser") 
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally 
cities_links = [main_url + tag["href"] for tag in city_tags] # Links to cities

来源

2017-03-16 17:50:50

用Python和美丽的汤刮网

回答

相关问题