2013-08-28 19 views
0

我想要获得每个页面内的所有应用程序链接。但问题是每个类别内的总页面不一样。 我有这样的代码:如何迭代未知总页数中的链接?

import urllib 
from bs4 import BeautifulSoup 

url ='http://www.brothersoft.com/windows/mp3_audio/' 
pageUrl = urllib.urlopen(url) 
soup = BeautifulSoup(pageUrl) 

for a in soup.select('div.coLeft.cate.mBottom dd a[href]'): 
     print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') 
     suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') 

     for page in range(1,27+1): 
       content = urllib.urlopen(suburl+'{}.html'.format(page)) 
       soup = BeautifulSoup(content) 
       for a in soup.select('div.freeText dl a[href]'): 
         print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') 

但我只得到应用的链接,在每个类别27页。 如果其他类别没有27页或超过27页,该怎么办?

回答

1

您可以提取的节目数量和20 devide它例如,如果您打开URL - http://www.brothersoft.com/windows/photo_image/font_tools/2.html则:

import re 
import urllib 
from bs4 import BeautifulSoup 

tmp = re.compile("1-(..)") 
url ='http://www.brothersoft.com/windows/photo_image/font_tools/2.html' 
pageUrl = urllib.urlopen(url) 
soup = BeautifulSoup(pageUrl) 
pages = soup.find("div", {"class":"freemenu coLeft Menubox"}) 
page = pages.text 
print int(re.search(r'of ([\d]+) ', page).group(1))/20 + 1 

输出将是:

18 

对于http://www.brothersoft.com/windows/photo_image/cad_software/6.html URL产值将是108

所以你需要打开一些页面,你可以找到多少页面。废弃这个数字,然后你可以运行你的循环。它可能是这样的:

import re 
import urllib 
from bs4 import BeautifulSoup 

tmp = re.compile("1-(..)") 
url ='http://www.brothersoft.com/windows/photo_image/' 
pageUrl = urllib.urlopen(url) 
soup = BeautifulSoup(pageUrl) 

for a in soup.select('div.coLeft.cate.mBottom dd a[href]'): 
     suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') 
     print suburl 

     content = urllib.urlopen(suburl+'2.html') 
     soup1 = BeautifulSoup(content) 
     pages = soup1.find("div", {"class":"freemenu coLeft Menubox"}) 
     page = pages.text 
     allPages = int(re.search(r'of ([\d]+) ', page).group(1))/20 + 1 
     print allPages 
     for page in range(1, allPages+1): 
       content = urllib.urlopen(suburl+'{}.html'.format(page)) 
       soup = BeautifulSoup(content) 
       for a in soup.select('div.freeText dl a[href]'): 
         print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') 
+0

非常感谢你@ ton1c..This真的很有用! –

+0

我刚刚发现第一页未打印。 –

+0

发布您的代码和哪些页面未打印 – ton1c