我试图获得文章的所有链接(正好有班级'标题可能为空'来表示它们)。我试图弄清楚为什么下面的代码在运行时会生成一大堆“href =”,而不是用实际的URL返回。在失败的25篇文章URL(全部'href =')后,我也收到了一堆随机文本和链接,但不知道为什么会发生这种情况,因为它停止查找类'标题可能空白'后应该停止。你们能帮我找出有什么问题吗?在Python中生成URL?
import urllib2
def get_page(page):
response = urllib2.urlopen(page)
html = response.read()
p = str(html)
return p
def get_next_target(page):
start_link = page.find('title may-blank')
start_quote = page.find('"', start_link + 4)
end_quote = page.find ('"', start_quote + 1)
aurl = page[start_quote+1:end_quote] # Gets Article URL
return aurl, end_quote
def print_all_links(page):
while True:
aurl, endpos = get_next_target(page)
if aurl:
print("%s" % (aurl))
print("")
page = page[endpos:]
else:
break
reddit_url = 'http://www.reddit.com/r/worldnews'
print_all_links(get_page(reddit_url))
为什么不使用像BeautifulSoup(http://www.crummy.com/software/BeautifulSoup/)这样的东西来刮取链接? – tttthomasssss 2014-09-02 08:05:19