使用美丽的汤从未知数量的页面刮取数据

我想从网站解析一些信息，数据在多个页面之间传播。使用美丽的汤从未知数量的页面刮取数据

问题是我不知道有多少页。可能有2个，但也可能有4个，甚至只有一个页面。

当我不知道有多少页面时，如何循环页面？不过我知道在下面的代码中看起来类似的url模式。

另外，页面名称不是普通数字，但它们分别在页面2的'pe2'和页面3的'pe4'等中，因此不能循环遍历范围（数字）。

我试图修复这个循环的伪代码。

pages=['','pe2', 'pe4', 'pe6', 'pe8',] 

import requests 
from bs4 import BeautifulSoup 
for i in pages: 
    url = "http://www.website.com/somecode/dummy?page={}".format(i) 
    r = requests.get(url) 
    soup = BeautifulSoup(r.content) 
    #rest of the scraping code

来源

2017-04-04 Alex T

只是增加数量，直到你得到一个404回应？ – jsbueno

那么，除了这个，我还得写些什么？它会怎么样？ –

是的，如果你得到一个例外，那里什么也没有。 –

您可以使用while循环在遇到异常时停止运行。

代码：

from bs4 import BeautifulSoup 
from time import sleep 
import requests 

i = 0 
while(True): 
    try: 
     if i == 0: 
      url = "http://www.website.com/somecode/dummy?page=pe" 
     else: 
      url = "http://www.website.com/somecode/dummy?page=pe{}".format(i) 
     r = requests.get(url) 
     soup = BeautifulSoup(r.content, 'html.parser') 

     #print page url 
     print(url) 

     #rest of the scraping code 

     #don't overflow website 
     sleep(2) 

     #increase page number 
     i += 2 
    except: 
     break

输出：

http://www.website.com/somecode/dummy?page 
http://www.website.com/somecode/dummy?page=pe2 
http://www.website.com/somecode/dummy?page=pe4 
http://www.website.com/somecode/dummy?page=pe6 
http://www.website.com/somecode/dummy?page=pe8 
... 
... and so on, until it faces an Exception.

来源

2017-04-04 14:50:14

酷我觉得这几乎解决了我的问题，除了第一页的url链接中没有“pe”。然后下一个是pe2，然后每个下一个增长+2。你有没有想法如何解决这个问题，而不需要创建大量的pe *列表？ –

@AlexT检查编辑的答案。您可以通过在每次迭代中使用'if/else'子句来实现此目的，同时通过'2'将变量'i'的值增加。 –

嗯莫名其妙它不停止后，不存在的网页。怎么来的？ –

使用美丽的汤从未知数量的页面刮取数据

回答

相关问题