为什么这个递归停止

Im新的python和我的代码如下：我有一个爬虫，在新发现的链接上递归。在根链接上递归之后，似乎程序在打印几条链接后停止，这应该继续一段时间，但不是。我正在捕捉和打印异常，但程序终止成功，所以我不知道为什么它会停止。为什么这个递归停止

from urllib import urlopen 
from bs4 import BeautifulSoup 

def crawl(url, seen): 
    try: 
    if any(url in s for s in seen): 
     return 0 
    html = urlopen(url).read() 

    soup = BeautifulSoup(html) 
    for tag in soup.findAll('a', href=True): 
     str = tag['href'] 
     if 'http' in str: 
     print tag['href'] 
     seen.append(str) 
     print "--------------" 
     crawl(str, seen) 
    except Exception, e: 
     print e 
     return 0 

def main(): 
    print "$ = " , crawl("http://news.google.ca", []) 


if __name__ == "__main__": 
    main()

来源

2012-07-28 Mike G

尝试记录您为每个请求收到的html。也许有些网站由于缺少用户代理或其他缺少http头部而给你空白结果？此外，href可能不包含协议（http或https），这意味着您将跳过它。 – Steve 2012-07-28 09:21:03

try: 
    if any(url in s for s in seen): 
     return 0

然后

seen.append(str) 
print "--------------" 
crawl(str, seen)

您可以附加str到seen，然后调用crawl与str和seen作为参数。显然你的代码会退出。你以这种方式设计了它。

更好的方法是抓取一个页面，将找到的所有链接添加到要抓取的列表中，然后继续抓取该列表中的所有链接。

简而言之，您不应该先进行深度优先爬网，而应该首先执行广度优先爬网。

这样的事情应该工作。

from urllib import urlopen 
from bs4 import BeautifulSoup 

def crawl(url, seen, to_crawl): 
    html = urlopen(url).read() 
    soup = BeautifulSoup(html) 
    seen.append(url) 
    for tag in soup.findAll('a', href=True): 
     str = tag['href'] 
     if 'http' in str: 
      if url not in seen and url not in to_crawl: 
       to_crawl.append(str) 
       print tag['href'] 
       print "--------------" 
    crawl(to_crawl.pop(), seen, to_crawl) 

def main(): 
    print "$ = " , crawl("http://news.google.ca", [], []) 


if __name__ == "__main__": 
    main()

尽管您可能想要限制它将爬行的URL的最大深度或最大数量。

来源

2012-07-28 09:30:32 elssar

for tag in soup.findAll('a', href=True): 
     str = tag['href'] 
     if 'http' in str: 
      print tag['href'] 
      seen.append(str)  # you put the newly founded url to *seen* 
      print "--------------" 
      crawl(str, seen)  # then you try to crawl it

但是，在开始的crawl

if any(url in s for s in seen): # you don't crawl url in *seen* 
    return 0

你应该追加url当你真的爬它，而不是当你发现它。

来源

2012-07-28 09:31:44 xiaowl

为什么这个递归停止

回答

相关问题