在Python中生成URL？

我试图获得文章的所有链接（正好有班级'标题可能为空'来表示它们）。我试图弄清楚为什么下面的代码在运行时会生成一大堆“href =”，而不是用实际的URL返回。在失败的25篇文章URL（全部'href ='）后，我也收到了一堆随机文本和链接，但不知道为什么会发生这种情况，因为它停止查找类'标题可能空白'后应该停止。你们能帮我找出有什么问题吗？在Python中生成URL？

import urllib2 

def get_page(page): 

    response = urllib2.urlopen(page) 
    html = response.read() 
    p = str(html) 
    return p 

def get_next_target(page): 
    start_link = page.find('title may-blank') 
    start_quote = page.find('"', start_link + 4) 
    end_quote = page.find ('"', start_quote + 1) 
    aurl = page[start_quote+1:end_quote] # Gets Article URL 
    return aurl, end_quote 

def print_all_links(page): 
    while True: 
     aurl, endpos = get_next_target(page) 
     if aurl: 
      print("%s" % (aurl)) 
      print("") 
      page = page[endpos:] 
     else: 
      break 

reddit_url = 'http://www.reddit.com/r/worldnews' 

print_all_links(get_page(reddit_url))

来源

2014-09-02 Phillipe Dongwoo Han

为什么不使用像BeautifulSoup（http://www.crummy.com/software/BeautifulSoup/）这样的东西来刮取链接？ – tttthomasssss 2014-09-02 08:05:19

Rawing是正确的，但是当我面对一个XY problem我更愿意提供实现X，而不是一个方法来解决Y的最佳途径。您应该使用一个HTML解析器像BeautifulSoup解析网页：

from bs4 import BeautifulSoup 
import urllib2 

def print_all_links(page): 
    html = urllib2.urlopen(page).read() 
    soup = BeautifulSoup(html) 
    for a in soup.find_all('a', 'title may-blank ', href=True): 
     print(a['href'])

如果你真的过敏HTML解析器，至少使用正则表达式（即使你应该坚持HTML解析）：

import urllib2 
import re 

def print_all_links(page): 
    html = urllib2.urlopen(page).read() 
    for href in re.findall(r'<a class="title may-blank " href="(.*?)"', html): 
     print(href)

来源

2014-09-02 08:09:34

谢谢！我会试试看，我不知道它存在。 – 2014-09-03 00:37:37

这是因为该行

start_quote = page.find('"', start_link + 4)

不会做你认为它的作用。 start_link被设置为“title may-blank”的索引。所以，如果你在start_link + 4做了一个page.find，你实际上开始搜索“e may-blank”。如果更改

start_quote = page.find('"', start_link + 4)

到

start_quote = page.find('"', start_link + len('title may-blank') + 1)

它会工作。

来源

2014-09-02 08:13:17

太好了，谢谢你的纠正。 – 2014-09-03 00:37:56

在Python中生成URL？

回答

相关问题