2016-03-15 43 views
-1
凑在python接下来的几页

假如我是刮网址如何使用Beautifulsoup

http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha 

,它的内容没有包含我要凑数据页。那么我怎样才能刮掉所有下一页的数据。 我正在使用python 3.5.1和Beautifulsoup。 注意:我不能使用scrapy和lxml,因为它给我一些安装错误。

回答

3

通过提取“转到最后一页”元素的参数page来确定最后一页。和遍历保持网络的刮会话的每个页面通过requests.Session()

import re 

import requests 
from bs4 import BeautifulSoup 


with requests.Session() as session: 
    # extract the last page 
    response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")  
    soup = BeautifulSoup(response.content, "html.parser") 
    last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1)) 

    # loop over every page 
    for page in range(last_page): 
     response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page) 
     soup = BeautifulSoup(response.content, "html.parser") 

     # print the title of every search result 
     for result in soup.select("li.search-result"): 
      title = result.find("div", class_="title").get_text(strip=True) 
      print(title) 

打印:

A C S College of Engineering, Bangalore 
A1 Global Institute of Engineering and Technology, Prakasam 
AAA College of Engineering and Technology, Thiruthangal 
... 
+0

感谢我向你学习了很多。 –