python urllib2尝试打开网址直到它工作

我需要从网站收集一些数据，实际上它是一些文本以供进一步分析。由于我不是网络报废方面的专家，因此我做了第一步，让我的网站获得所需的文档。问题是，有时我可以获取文档，但有时会出现连接超时错误。所以我想为尝试，直到我可以得到网站的响应的方式，这是我所：python urllib2尝试打开网址直到它工作

from html2text import * 
import urllib2 
import html2text 
from bs4 import BeautifulSoup 

id = 1 
with open("urls.txt") as f: 
    for url in f: 
     print url 
     html = urllib2.urlopen(url).read() 
     soup = BeautifulSoup(html, "html.parser") 

     with codecs.open("documentos/" + str(id) + ".txt", "w", "utf-8-sig") as temp: 
      temp.write(soup.get_text()) 
     id += 1

其中urls.txt具有所需的URL，URL的一个例子：

我怎样才能做到这一点？如果我只需要10个文档，我可以处理它，但是我需要超过500个...因此我无法手动完成。

总结：

有时候，我能得到的文件，有时我不能因为超时，我想蟒蛇尝试，直到它可以获取文档...

来源

2015-10-20 dpalma

你必须能够更好地结构中的函数用于获取站点信息。一旦你有这个功能，你可以使用retry decorator。

来源

2015-10-20 01:05:37

您可以使用urllib2.urlopen()的超时参数，如下所示：Handling urllib2's timeout? - Python, 和和重试装饰器。

来源

2015-10-20 01:19:57

python urllib2尝试打开网址直到它工作

回答

相关问题