2015-07-28 74 views
0

我有我的最后一个问题在这里:last questionScrapy不是爬行

现在我已经尽了全力,以考虑并提高自己的蜘蛛的结构。然而,由于某些原因,我的蜘蛛仍然无法开始抓取。

我也检查过xpath,他们的工作(在铬控制台)。

我加入了带有href的url,因为href总是只返回参数。我在最后一个问题中附加了一个示例链接格式。 (我想保持这个帖子AWAG从越来越冗长)

我的蜘蛛:

class kmssSpider(scrapy.Spider): 
    name='kmss' 
    start_url = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#{unid=ADE682E34FC59D274825770B0037D278}' 
    login_page = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login' 
    allowed_domain = ["kmssqkr.hksarg"] 

    def start_requests(self): 
     yield Request(url=self.login_page, callback=self.login ,dont_filter = True 
       ) 
    def login(self,response): 
     return FormRequest.from_response(response,formdata={'user':'usename','password':'pw'}, 
             callback = self.check_login_response) 

    def check_login_response(self,response): 
     if 'Welcome' in response.body: 
      self.log("\n\n\n\n Successfuly Logged in \n\n\n ") 
      yield Request(url=self.start_url, 
          cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           ) 
     else: 
      self.log("\n\n You are not logged in \n\n ") 

    def parse(self,response): 
     listattheleft = response.xpath("*//*[@class='qlist']/li[not(contains(@role,'menuitem'))]") 
     anyfolder = response.xpath("*//*[@class='q-folderItem']/h4") 
     anyfile = response.xpath("*//*[@class='q-otherItem']/h4") 
     for each_tab in listattheleft: 
      item = CrawlkmssItem() 
      item['url'] = each_tab.xpath('a/@href').extract() 
      item['title'] = each_tab.xpath('a/text()').extract() 
      yield item 

      if 'unid' not in each_tab.xpath('./a').extract(): 
       parameter = each_tab.xpath('a/@href').extract() 
       locatetheroom = parameter.find('PageLibrary') 
       item['room'] = parameter[locatetheroom:] 
       locatethestart = response.url.find('#',0) 
       full_url = response.url[:locatethestart] + parameter 
       yield Request(url=full_url, 
           cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           ) 

     for folder in anyfolder: 
      folderparameter = folder.xpath('a/@href').extract() 
      locatethestart = response.url.find('#',0) 
      folder_url = response.url[:locatethestart]+ folderparameter 
      yield Request(url=folder_url, callback='parse_folder', 
          cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           )   

     for File in anyfile: 
      fileparameter = File.xpath('a/@href').extract() 
      locatethestart = response.url.find('#',0) 
      file_url = response.url[:locatethestart] + fileparameter 
      yield Request(url=file_url, callback='parse_file', 
          cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           ) 

    def parse_folder(self,response): 
     findfolder = response.xpath("//div[@class='lotusHeader']") 
     folderitem= CrawlkmssFolder() 
     folderitem['foldername'] = findfolder.xpath('h1/span/span/text()').extract() 
     folderitem['url']= response.url[response.url.find("unid=")+5:]  
     yield folderitem 


    def parse_file(self,response): 
     findfile = response.xpath("//div[@class='lotusContent']") 
     fileitem = CrawlkmssFile() 
     fileitem['filename']=findfile.xpath('a/text()').extract() 
     fileitem['title']=findfile.xpath(".//div[@class='qkrTitle']/span/@title").extract() 
     fileitem['author']=findfile.xpath(".//div[@class='lotusMeta']/span[3]/span/text()").extract() 
     yield fileitem 

我打算抓取的信息:

左侧栏出现:

enter image description here

文件夹:

enter image description here

日志:

c:\Users\~\crawlKMSS>scrapy crawl kmss 
2015-07-28 17:54:59 [scrapy] INFO: Scrapy 1.0.1 started (bot: crawlKMSS) 
2015-07-28 17:54:59 [scrapy] INFO: Optional features available: ssl, http11, boto 
2015-07-28 17:54:59 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'crawlKMSS.spiders', 'SPIDER_MODULES': ['crawlKMSS.spiders'], 'BOT_NAME': 'crawlKMSS'} 
2015-07-28 17:54:59 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. 

2015-07-28 17:54:59 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2015-07-28 17:54:59 [boto] DEBUG: Retrieving credentials from metadata server. 
2015-07-28 17:55:00 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open 
    response = self._open(req, data) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open 
    '_open', req) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain 
    result = func(*args) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2015-07-28 17:55:00 [boto] ERROR: Unable to read instance data, giving up 
2015-07-28 17:55:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2015-07-28 17:55:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2015-07-28 17:55:01 [scrapy] INFO: Enabled item pipelines: 
2015-07-28 17:55:01 [scrapy] INFO: Spider opened 
2015-07-28 17:55:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-07-28 17:55:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-07-28 17:55:05 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None) 
2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr..hksarg/names.nsf?Login> (referer: https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login) 
2015-07-28 17:55:10 [kmss] DEBUG: 



Successfuly Logged in 



2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#%7Bunid=ADE682E34FC59D274825770B0037D278%7D> (referer: https://kmssqkr.hksarg/names.nsf?Login) 
2015-07-28 17:55:10 [scrapy] INFO: Closing spider (finished) 
2015-07-28 17:55:10 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1636, 

希望得到任何帮助!

回答

1

我觉得你太过于复杂了,为什么当你有scrapy.Crawler时,你会从类scrapy.Spider继承繁重的工作?一个Spider通常用于抓取页面列表,而Crawler用于抓取网站。

这是抓取常规网站最常用的蜘蛛,因为它提供了一个方便的机制,通过定义一组规则来跟踪链接。

+0

我面临着python知识的限制,我使用crawlspider时遇到困难,尤其是当我需要从网站的多个地方提取项目/登录后不开始爬行 – yukclam9

1

您的日志中存在警告,并且您的回溯表明在打开httpConnection时出现此错误。

2015年7月28日17时54分59秒[py.warnings]警告:0:UserWarning:你不是 有service_identity模块的工作安装: '无模块 命名service_identity'。请从 https://pypi.python.org/pypi/service_identity安装它,并确保其所有的依赖关系都满足 。如果没有service_identity模块 和最近足够的pyOpenSSL来支持它,Twisted只能执行 基本的TLS客户端主机名验证。许多有效的 证书/主机名映射可能会被拒绝。

+0

这非常棘手(我认为),因为我已经安装了最新版本的服务标识和pyopenssl。我应该如何处理这个错误? – yukclam9

+0

棘手的事情确实是以某种方式说服scrapy必须是错误的。因为scrapy告诉你:**你没有安装service_identity模块**。或者尝试修复你的设置可能会更容易。 –