如何查找并从网页中提取链接？

我的网站，如http://site.com如何查找并从网页中提取链接？

我想取主页，只提取匹配的正则表达式的链接，例如.*somepage.*

的HTML代码链接的格式可以是：

<a href="http://site.com/my-somepage">url</a> 
<a href="/my-somepage.html">url</a> 
<a href="my-somepage.htm">url</a>

我需要输出格式：

http://site.com/my-somepage 
http://site.com/my-somepage.html 
http://site.com/my-somepage.htm

输出url必须包含域名总是。

什么是快速Python解决方案？

来源

2013-03-19 Alex

那你试试，没有工作？ StackOverflow不是一种代码编写服务，但如果您首先解决问题，我们会为您提供帮助。 – 2013-03-19 04:15:54

查看一个HTML解析模块，比如BeautifulSoup。（会发布一个链接，但我在我的手机上，对不起） – TerryA 2013-03-19 04:24:20

你可以使用lxml.html ：

from lxml import html 

url = "http://site.com" 
doc = html.parse(url).getroot() # download & parse webpage 
doc.make_links_absolute(url) 
for element, attribute, link, _ in doc.iterlinks(): 
    if (attribute == 'href' and element.tag == 'a' and 
     'somepage' in link): # or e.g., re.search('somepage', link) 
     print(link)

或者使用beautifulsoup4：

import re 
try: 
    from urllib2 import urlopen 
    from urlparse import urljoin 
except ImportError: # Python 3 
    from urllib.parse import urljoin 
    from urllib.request import urlopen 

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4 

url = "http://site.com" 
only_links = SoupStrainer('a', href=re.compile('somepage')) 
soup = BeautifulSoup(urlopen(url), parse_only=only_links) 
urls = [urljoin(url, a['href']) for a in soup(only_links)] 
print("\n".join(urls))

来源

2013-03-19 07:26:48 jfs

使用HTML解析模块，如BeautifulSoup。
一些代码（只有部分）：

from bs4 import BeautifulSoup 
import re 

html = '''<a href="http://site.com/my-somepage">url</a> 
<a href="/my-somepage.html">url</a> 
<a href="my-somepage.htm">url</a>''' 
soup = BeautifulSoup(html) 
links = soup.find_all('a',{'href':re.compile('.*somepage.*')}) 
for link in links: 
    print link['href']

输出：

http://site.com/my-somepage 
/my-somepage.html 
my-somepage.htm

你应该能够让你从这么多的数据需要的格式...

来源

2013-03-19 07:06:18 pradyunsg

Scrapy是最简单的方法来做你想做的事。实际上有链接提取机制built-in。

让我知道如果您需要编写蜘蛛抓取链接的帮助。

请另见：

How do I use the Python Scrapy module to list all the URLs from my website?

来源

2013-03-19 07:30:12 alecxe

如何查找并从网页中提取链接？

回答

相关问题