如何使用Beautifulsoup

-1

凑在python接下来的几页

http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha

，它的内容没有包含我要凑数据页。那么我怎样才能刮掉所有下一页的数据。我正在使用python 3.5.1和Beautifulsoup。注意：我不能使用scrapy和lxml，因为它给我一些安装错误。

来源

2016-03-15 Aman Kumar

通过提取“转到最后一页”元素的参数page来确定最后一页。和遍历保持网络的刮会话的每个页面通过requests.Session()：

import re 

import requests 
from bs4 import BeautifulSoup 


with requests.Session() as session: 
    # extract the last page 
    response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")  
    soup = BeautifulSoup(response.content, "html.parser") 
    last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1)) 

    # loop over every page 
    for page in range(last_page): 
     response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page) 
     soup = BeautifulSoup(response.content, "html.parser") 

     # print the title of every search result 
     for result in soup.select("li.search-result"): 
      title = result.find("div", class_="title").get_text(strip=True) 
      print(title)

打印：

A C S College of Engineering, Bangalore 
A1 Global Institute of Engineering and Technology, Prakasam 
AAA College of Engineering and Technology, Thiruthangal 
...

来源

2016-03-15 15:52:56 alecxe

感谢我向你学习了很多。 –

如何使用Beautifulsoup

回答

相关问题