2013-03-20 72 views
9

我必须在Python中编写Web爬网程序。我不知道如何解析页面并从HTML中提取URL。我应该去哪里学习写这样一个计划?如何从Python中的HTML页面中提取URL

换句话说,是否有一个简单的python程序可以用作通用网络爬虫的模板?理想情况下,它应该使用相对简单的模块,并且应该包含大量的注释来描述每行代码的功能。

回答

16

请看下面的示例代码。该脚本提取网页的HTML代码(这里是Python主页),并提取该页面中的所有链接。希望这可以帮助。

#!/usr/bin/env python 

import requests 
from BeautifulSoup import BeautifulSoup 

url = "http://www.python.org" 
response = requests.get(url) 
# parse html 
page = str(BeautifulSoup(response.content)) 


def getURL(page): 
    """ 

    :param page: html of web page (here: Python home page) 
    :return: urls in that page 
    """ 
    start_link = page.find("a href") 
    if start_link == -1: 
     return None, 0 
    start_quote = page.find('"', start_link) 
    end_quote = page.find('"', start_quote + 1) 
    url = page[start_quote + 1: end_quote] 
    return url, end_quote 

while True: 
    url, n = getURL(page) 
    page = page[n:] 
    if url: 
     print url 
    else: 
     break 

输出:

/ 
#left-hand-navigation 
#content-body 
/search 
/about/ 
/news/ 
/doc/ 
/download/ 
/getit/ 
/community/ 
/psf/ 
http://docs.python.org/devguide/ 
/about/help/ 
http://pypi.python.org/pypi 
/download/releases/2.7.3/ 
http://docs.python.org/2/ 
/ftp/python/2.7.3/python-2.7.3.msi 
/ftp/python/2.7.3/Python-2.7.3.tar.bz2 
/download/releases/3.3.0/ 
http://docs.python.org/3/ 
/ftp/python/3.3.0/python-3.3.0.msi 
/ftp/python/3.3.0/Python-3.3.0.tar.bz2 
/community/jobs/ 
/community/merchandise/ 
/psf/donations/ 
http://wiki.python.org/moin/Languages 
http://wiki.python.org/moin/Languages 
http://www.google.com/calendar/ical/b6v58qvojllt0i6ql654r1vh00%40group.calendar.google.com/public/basic.ics 
http://www.google.com/calendar/ical/j7gov1cmnqr9tvg14k621j7t5c%40group.calendar.google.com/public/basic.ics 
http://pycon.org/#calendar 
http://www.google.com/calendar/ical/3haig2m9msslkpf2tn1h56nn9g%40group.calendar.google.com/public/basic.ics 
http://pycon.org/#calendar 
http://www.psfmember.org 

...

3

您可以使用beautifulsoup提取的HTML的URL。按照文档,看看什么符合您的要求。该文档还包含了如何提取URL的代码片段。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc) 

soup.find_all('a') # Finds all hrefs from the html doc. 
5
import sys 
import re 
import urllib2 
import urlparse 
tocrawl = set(["http://www.facebook.com/"]) 
crawled = set([]) 
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>') 
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') 

while 1: 
    try: 
     crawling = tocrawl.pop() 
     print crawling 
    except KeyError: 
     raise StopIteration 
    url = urlparse.urlparse(crawling) 
    try: 
     response = urllib2.urlopen(crawling) 
    except: 
     continue 
    msg = response.read() 
    startPos = msg.find('<title>') 
    if startPos != -1: 
     endPos = msg.find('</title>', startPos+7) 
     if endPos != -1: 
      title = msg[startPos+7:endPos] 
      print title 
    keywordlist = keywordregex.findall(msg) 
    if len(keywordlist) > 0: 
     keywordlist = keywordlist[0] 
     keywordlist = keywordlist.split(", ") 
     print keywordlist 
    links = linkregex.findall(msg) 
    crawled.add(crawling) 
    for link in (links.pop(0) for _ in xrange(len(links))): 
     if link.startswith('/'): 
      link = 'http://' + url[1] + link 
     elif link.startswith('#'): 
      link = 'http://' + url[1] + url[2] + link 
     elif not link.startswith('http'): 
      link = 'http://' + url[1] + '/' + link 
     if link not in crawled: 
      tocrawl.add(link) 

参考:Python Web Crawler in Less Than 50 Lines(慢或不再工作,不加载对我来说)

12

您可以使用BeautifulSoup正如许多人还指出。它可以解析HTML,XML等。要查看它的一些功能,请参阅here

实施例:

import urllib2 
from bs4 import BeautifulSoup 
url = 'http://www.google.co.in/' 

conn = urllib2.urlopen(url) 
html = conn.read() 

soup = BeautifulSoup(html) 
links = soup.find_all('a') 

for tag in links: 
    link = tag.get('href',None) 
    if link is not None: 
     print link