2016-07-16 50 views
-1

我有一个脚本,我写了,我用美丽的汤刮搜索结果的网站。我设法通过类名来隔离我想要的数据。如何通过美丽的汤刮网页迭代通过多个结果页

但是,搜索结果不在单个页面上。相反,它们分布在多个页面上,所以我想让它们全部。我想让我的脚本能够检查是否有下一个结果页面,并在那里运行。由于结果数量不同,我不知道有多少页面的结果存在,所以我不能预先定义一个范围来迭代。我也尝试使用'if_page_exists'检查。但是,如果我把一个页面编号超出结果范围,页面总是存在,它只是没有任何结果,但有一个页面,说没有结果要显示。

然而,我注意到,每个页面结果都有一个'Next'链接,其id为'NextLink1',最后一页结果没有这个。所以我认为那可能是魔法。但我不知道如何以及在哪里实施该检查。我一直在获得无限循环和东西。

下面的脚本查找搜索项“x”的结果。援助将不胜感激。

from urllib.request import urlopen 
from bs4 import BeautifulSoup 

#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 
all_letters= ['x'] 
for letter in all_letters: 

    page_number = 1 
    url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number) 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html) 
    nameList = bsObj.findAll("td", {"class":"party-name"}) 

    for name in nameList: 
     print(name.get_text()) 

而且,没有人知道实例的字母数字字符的名单综合类更好的比我上面的脚本注释掉的一个较短的方法吗?

+0

所以基本上你想'如果bsObj.find('a',id ='NextLink1'):page_number + = 1'? –

+0

问题是我实例化bsObj之前创建并解析了我的url。所以我不知道如何我可以更改网址后,我已经做了这个检查 –

+0

请**不要转发问题** [我怎样才能使网页刮板遍历使用美丽的汤多个搜索结果?](http:///stackoverflow.com/questions/38364642/how-can-i-make-a-web-scraper-traverse-multiple-pages-of-search-results-using-bea) –

回答

0

试试这个:

from urllib.request import urlopen 
from bs4 import BeautifulSoup 


#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 
all_letters= ['x'] 
pages = [] 

def get_url(letter, page_number): 
    return "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number) 

def list_names(soup): 
    nameList = soup.findAll("td", {"class":"party-name"}) 
    for name in nameList: 
     print(name.get_text()) 

def get_soup(letter, page): 
    url = get_url(letter, page) 
    html = urlopen(url) 
    return BeautifulSoup(html) 

def main(): 
    for letter in all_letters: 
     bsObj = get_soup(letter, 1) 

     sel = bsObj.find('select', {"name": "ctl00$ctl00$InternetApplication_Body$WebApplication_Body$SearchResultPageList1"})  
     for opt in sel.findChildren("option", selected = lambda x: x != "selected"): 
      pages.append(opt.string) 

     list_names(bsObj) 

     for page in pages: 
      bsObj = get_soup(letter, page) 
      list_names(bsObj) 
main() 

main()功能,从get_soup(letter, 1)第一页,我们发现,存放在一个列表,其中包含所有页码选择选项的值。

接下来,我们循环页码以从下一页提取数据。