2016-07-05 125 views
0

我试图运行一个网络爬行器的示例程序from netinstructions.com但它不工作。我使用运行该程序:程序无法正常工作

spider("http://www.netinstructions.com/", "python", 50) 

但它总是返回

1 Visiting: http://www.netinstructions.com 
Word never found 

无论我输入什么网址。该程序的代码如下:

from html.parser import HTMLParser 
from urllib.request import urlopen 
from urllib import parse 

# We are going to create a class called LinkParser that inherits some 
# methods from HTMLParser which is why it is passed into the definition 
class LinkParser(HTMLParser): 

    # This is a function that HTMLParser normally has 
    # but we are adding some functionality to it 
    def handle_starttag(self, tag, attrs): 
     # We are looking for the begining of a link. Links normally look 
     # like <a href="www.someurl.com"></a> 
     if tag == 'a': 
      for (key, value) in attrs: 
       if key == 'href': 
        # We are grabbing the new URL. We are also adding the 
        # base URL to it. For example: 
        # www.netinstructions.com is the base and 
        # somepage.html is the new URL (a relative URL) 
        # 
        # We combine a relative URL with the base URL to create 
        # an absolute URL like: 
        # www.netinstructions.com/somepage.html 
        newUrl = parse.urljoin(self.baseUrl, value) 
        # And add it to our colection of links: 
        self.links = self.links + [newUrl] 

    # This is a new function that we are creating to get links 
    # that our spider() function will call 
    def getLinks(self, url): 
     self.links = [] 
     # Remember the base URL which will be important when creating 
     # absolute URLs 
     self.baseUrl = url 
     # Use the urlopen function from the standard Python 3 library 
     response = urlopen(url) 
     # Make sure that we are looking at HTML and not other things that 
     # are floating around on the internet (such as 
     # JavaScript files, CSS, or .PDFs for example) 
     if response.getheader('Content-Type')=='text/html': 
      htmlBytes = response.read() 
      # Note that feed() handles Strings well, but not bytes 
      # (A change from Python 2.x to Python 3.x) 
      htmlString = htmlBytes.decode("utf-8") 
      self.feed(htmlString) 
      return htmlString, self.links 
     else: 
      return "",[] 

# And finally here is our spider. It takes in an URL, a word to find, 
# and the number of pages to search through before giving up 
def spider(url, word, maxPages): 
    pagesToVisit = [url] 
    numberVisited = 0 
    foundWord = False 
    # The main loop. Create a LinkParser and get all the links on the page. 
    # Also search the page for the word or string 
    # In our getLinks function we return the web page 
    # (this is useful for searching for the word) 
    # and we return a set of links from that web page 
    # (this is useful for where to go next) 
    while numberVisited < maxPages and pagesToVisit != [] and not foundWord: 
     numberVisited = numberVisited +1 
     # Start from the beginning of our collection of pages to visit: 
     url = pagesToVisit[0] 
     pagesToVisit = pagesToVisit[1:] 
     try: 
      print(numberVisited, "Visiting:", url) 
      parser = LinkParser() 
      data, links = parser.getLinks(url) 
      if data.find(word)> -1: 
       foundWord = True 
       # Add the pages that we visited to the end of our collection 
       # of pages to visit: 
       pagesToVisit = pagesToVisit + links 
       print(" **Success!**") 
     except: 
      print(" **Failed!**") 
    if foundWord: 
     print("The word", word, "was found at", url) 
    else: 
     print("Word never found") 

有谁知道发生了什么事?我使用Python 3.5(32位)并在Windows 10上运行。

+1

为丘陵运行,使用毯子,除了任何教程是不是一个我会建议,错误很明显,如果你'除了异常为e:打印(E )'。即'LinkParser'对象没有'getLinks'属性,尽管错误是你的错误'def getLinks(self,url):'应该在类中。我会建议你检查请求和BeautifulSoup,如果你想要两个漂亮的网页抓取 –

+0

我忘了缩进代码,但我现在已经修复了这个问题。 – timgrindall

回答

1

response.getheader('Content-Type')返回text/html; charset=utf-8这不等于text/html所以你根本就没有得到任何链接。你可以看它是否包含在字符串中

def getLinks(self, url): 
    self.links = [] 
    # Remember the base URL which will be important when creating 
    # absolute URLs 
    self.baseUrl = url 
    # Use the urlopen function from the standard Python 3 library 
    response = urlopen(url) 
    # Make sure that we are looking at HTML and not other things that 
    # are floating around on the internet (such as 
    # JavaScript files, CSS, or .PDFs for example) 
    if 'text/html' in response.getheader('Content-Type') 

而且pagesToVisit = pagesToVisit + links应该是外面的,如果你只会添加链接,如果发现是!= -1。进行以下更改,您的代码将运行:

def getLinks(self, url): 
     self.links = [] 
     # Remember the base URL which will be important when creating 
     # absolute URLs 
     self.baseUrl = url 
     # Use the urlopen function from the standard Python 3 library 
     response = urlopen(url) 
     # Make sure that we are looking at HTML and not other things that 
     # are floating around on the internet (such as 
     print(response.getheader('Content-Type')) 
     # JavaScript files, CSS, or .PDFs for example) 
     if 'text/html' in response.getheader('Content-Type'): 
      htmlBytes = response.read() 
      # Note that feed() handles Strings well, but not bytes 
      # (A change from Python 2.x to Python 3.x) 
      htmlString = htmlBytes.decode("utf-8") 
      self.feed(htmlString) 
      return htmlString, self.links 
     return "",[] 

# And finally here is our spider. It takes in an URL, a word to find, 
# and the number of pages to search through before giving up 
def spider(url, word, maxPages): 
    pagesToVisit = [url] 
    foundWord = False 
    # The main loop. Create a LinkParser and get all the links on the page. 
    # Also search the page for the word or string 
    # In our getLinks function we return the web page 
    # (this is useful for searching for the word) 
    # and we return a set of links from that web page 
    # (this is useful for where to go next) 
    for ind, url in enumerate(pagesToVisit, 1): 
     if ind >= maxPages or foundWord: 
      break 
     # Start from the beginning of our collection of pages to visit: 
     try: 
      print(ind, "Visiting:", url) 
      parser = LinkParser() 
      data, links = parser.getLinks(url) 
      if data.find(word)> -1: 
       foundWord = True 
       # Add the pages that we visited to the end of our collection 
       # of pages to visit: 
       print(" **Success!**") 
      pagesToVisit.extend(links) 
     except Exception as e: 
      print(" **Failed!**") 
    if foundWord: 
     print("The word", word, "was found at", url) 
    else: 
     print("Word never found") 

spider("http://www.netinstructions.com/", "python", 50) 
+1

恭喜100k :) – Winterflags

+0

@winterflags,欢呼:) –