从网页提取所有链接

-3

我想写一个功能，需要一个网页的URL，下载网页，并返回该网页的URL列表（使用urllib模块）任何帮助，将不胜感激从网页提取所有链接

2011-05-01 matt

你有什么这么远吗？你有什么具体问题？ – Mat 2011-05-01 11:15:29

这个问题有多差？ – 2011-05-01 11:19:08

我们不会为你做你的功课。 – 2011-05-01 11:29:17

在这里你去：

import sys 
import urllib2 
import lxml.html 

try: 
    url = sys.argv[1] 
except IndexError: 
    print "Specify a url to scrape" 
    sys.exit(1) 

if not url.startswith("http://"): 
    print "Please include the http:// at the beginning of the url" 
    sys.exit(1) 

html = urllib2.urlopen(url).read() 
etree = lxml.html.fromstring(html) 

for href in etree.xpath("//a/@href"): 
    print href

 
C:\Programming>getlinks.py http://example.com 
/
/domains/ 
/numbers/ 
/protocols/ 
/about/ 
/go/rfc2606 
/about/ 
/about/presentations/ 
/about/performance/ 
/reports/ 
/domains/ 
/domains/root/ 
/domains/int/ 
/domains/arpa/ 
/domains/idn-tables/ 
/protocols/ 
/numbers/ 
/abuse/ 
http://www.icann.org/ 
mailto:[email protected]?subject=General%20website%20feedback

来源

2011-05-01 11:34:50 Acorn

+1对于lxml来说就是这个意思。 – 2011-05-01 11:48:47

我必须使用urllib模块 – matt 2011-05-01 12:47:24

我编辑脚本以使用urllib2单独下载页面。 – Acorn 2011-05-01 12:57:53

从网页提取所有链接

回答

相关问题