2017-03-09 96 views
2

我已经浪费了几天,让我的周围Scrapy脑海之后CrawlSpider停止,阅读文档等Scrapy博客和Q &一个......现在我即将做什么男人最讨厌:问为方向;-)问题是:我的蜘蛛打开,获取start_urls,但显然没有与他们做任何事情。相反,它立即关闭,就是这样。显然,我甚至没有看到第一个self.log()语句。的Python/Scrapy:取start_urls

到目前为止,我已经得到了这是什么:

# -*- coding: utf-8 -*- 
import scrapy 
# from scrapy.shell import inspect_response 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse, FormRequest, Request 
from KiPieSpider.items import * 
from KiPieSpider.settings import * 

class KiSpider(CrawlSpider): 
    name = "KiSpider" 
    allowed_domains = ['www.kiweb.de', 'kiweb.de'] 
    start_urls = (
     # ST Regra start page: 
     'https://www.kiweb.de/default.aspx?pageid=206', 
      # follow ST Regra links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=206&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
     # ST Thermo start page: 
     'https://www.kiweb.de/default.aspx?pageid=202&page=1', 
      # follow ST Thermo links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=202&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
    ) 
    rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 
       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    ) 

    # Once an article page is reached, check whether a login is necessary: 
    def parse_init(self, response): 
     self.log('Parsing article: %s' % response.url) 
     if not response.xpath('input[@value="Logout"]'): 
      # Note: response.xpath() is a shortcut of response.selector.xpath() 
      self.log('Not logged in. Logging in...\n') 
      return self.login(response) 
     else: 
      self.log('Already logged in. Continue crawling...\n') 
      return self.parse_item(response) 


    def login(self, response): 
     self.log("Trying to log in...\n") 
     self.username = self.settings['KI_USERNAME'] 
     self.password = self.settings['KI_PASSWORD'] 
     return FormRequest.from_response(
      response, 
      formname='Form1', 
      formdata={ 
       # needs name, not id attributes! 
       'ctl04$Header$ctl01$textbox_username': self.username, 
       'ctl04$Header$ctl01$textbox_password': self.password, 
       'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort', 
       'ctl04$Header$ctl01$checkbox_permanent': 'True', 
      }, 
      callback = self.parse_item, 
     ) 

    def parse_item(self, response): 
     articles = response.xpath('//div[@id="artikel"]') 
     items = [] 
     for article in articles: 
      item = KiSpiderItem() 
      item['link'] = response.url 
      item['title'] = articles.xpath("div[@class='ct1']/text()").extract() 
      item['subtitle'] = articles.xpath("div[@class='ct2']/text()").extract() 
      item['article'] = articles.extract() 
      item['published'] = articles.xpath("div[@class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE") 
      item['artid'] = articles.xpath("div[@class='biblio']/text()").re(r"PIE \[(d+)-\d+\]") 
      item['lang'] = 'de-DE' 
      items.append(item) 
#  return(items) 
     yield items 
#  what is the difference between return and yield?? found both on web. 

在做scrapy crawl KiSpider,这导致:

2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider) 
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider ([email protected])', 'DOWNLOAD_DELAY': 0.25} 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened 
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 465, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 48998, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 2, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)} 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished) 

难道登录程序不应与回调结束,但某种回报/收益声明?或者我做错了什么?不幸的是,到目前为止我看到的文档和教程只能给我一个模糊的想法,说明每个位如何连接到其他位置,特别是Scrapy的文档似乎被编写为已经了解了很多关于Scrapy的人员的参考。

有点沮丧问候 克里斯托弗

回答

0
rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 

       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    ) 

你不需要allow参数,因为只有一个在与XPath选择的标记链接。

我不明白在允许参数的正则表达式,但至少你应该逃脱?enter image description here

+1

非常感谢你,这是非转义的吗?在允许参数里面! –