的Python/Scrapy：取start_urls

我已经浪费了几天，让我的周围Scrapy脑海之后CrawlSpider停止，阅读文档等Scrapy博客和Q &一个......现在我即将做什么男人最讨厌：问为方向;-)问题是：我的蜘蛛打开，获取start_urls，但显然没有与他们做任何事情。相反，它立即关闭，就是这样。显然，我甚至没有看到第一个self.log（）语句。的Python/Scrapy：取start_urls

到目前为止，我已经得到了这是什么：

# -*- coding: utf-8 -*- 
import scrapy 
# from scrapy.shell import inspect_response 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse, FormRequest, Request 
from KiPieSpider.items import * 
from KiPieSpider.settings import * 

class KiSpider(CrawlSpider): 
    name = "KiSpider" 
    allowed_domains = ['www.kiweb.de', 'kiweb.de'] 
    start_urls = (
     # ST Regra start page: 
     'https://www.kiweb.de/default.aspx?pageid=206', 
      # follow ST Regra links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=206&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
     # ST Thermo start page: 
     'https://www.kiweb.de/default.aspx?pageid=202&page=1', 
      # follow ST Thermo links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=202&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
    ) 
    rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 
       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    ) 

    # Once an article page is reached, check whether a login is necessary: 
    def parse_init(self, response): 
     self.log('Parsing article: %s' % response.url) 
     if not response.xpath('input[@value="Logout"]'): 
      # Note: response.xpath() is a shortcut of response.selector.xpath() 
      self.log('Not logged in. Logging in...\n') 
      return self.login(response) 
     else: 
      self.log('Already logged in. Continue crawling...\n') 
      return self.parse_item(response) 


    def login(self, response): 
     self.log("Trying to log in...\n") 
     self.username = self.settings['KI_USERNAME'] 
     self.password = self.settings['KI_PASSWORD'] 
     return FormRequest.from_response(
      response, 
      formname='Form1', 
      formdata={ 
       # needs name, not id attributes! 
       'ctl04$Header$ctl01$textbox_username': self.username, 
       'ctl04$Header$ctl01$textbox_password': self.password, 
       'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort', 
       'ctl04$Header$ctl01$checkbox_permanent': 'True', 
      }, 
      callback = self.parse_item, 
     ) 

    def parse_item(self, response): 
     articles = response.xpath('//div[@id="artikel"]') 
     items = [] 
     for article in articles: 
      item = KiSpiderItem() 
      item['link'] = response.url 
      item['title'] = articles.xpath("div[@class='ct1']/text()").extract() 
      item['subtitle'] = articles.xpath("div[@class='ct2']/text()").extract() 
      item['article'] = articles.extract() 
      item['published'] = articles.xpath("div[@class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE") 
      item['artid'] = articles.xpath("div[@class='biblio']/text()").re(r"PIE \[(d+)-\d+\]") 
      item['lang'] = 'de-DE' 
      items.append(item) 
#  return(items) 
     yield items 
#  what is the difference between return and yield?? found both on web.

在做scrapy crawl KiSpider，这导致：

2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider) 
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider ([email protected])', 'DOWNLOAD_DELAY': 0.25} 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened 
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 465, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 48998, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 2, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)} 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished)

难道登录程序不应与回调结束，但某种回报/收益声明？或者我做错了什么？不幸的是，到目前为止我看到的文档和教程只能给我一个模糊的想法，说明每个位如何连接到其他位置，特别是Scrapy的文档似乎被编写为已经了解了很多关于Scrapy的人员的参考。

有点沮丧问候克里斯托弗

来源

2017-03-09 Christopher Köbel

rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 

       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    )

你不需要allow参数，因为只有一个在与XPath选择的标记链接。

我不明白在允许参数的正则表达式，但至少你应该逃脱?。

来源

2017-03-10 04:49:56

非常感谢你，这是非转义的吗？在允许参数里面！ –

的Python/Scrapy：取start_urls

回答

相关问题