需要CrawlSpider帮助scrapy

我是新来scrapy和stucked，当我尝试使用CrawlSpider从多个网站提取数据。需要CrawlSpider帮助scrapy

这里是我的代码：

class ivwSpider(CrawlSpider): 

    name = "ivw-online" 
    allowed_domains = ["ausweisung.ivw-online.de/"] 
    start_urls = ["http://ausweisung.ivw-online.de/index.php?i=1161&a=o44847"] 

    pagelink = LinkExtractor(allow=('index.php?i=1161&a=o\d{5}')) 
    print(pagelink) 
    rules = (Rule(pagelink, callback='parse_item', follow=True),) 

    def parse_item(self, response): 

     sel = Selector(response) 

     item = IVWItem() 
     item["Type"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[0].extract() 
     item["Zeitraum"] = sel.xpath('//div[@class ="tabelle"]//tr[1]//div[@style="width:210px; text-align:center;"]/text()')[0].extract() 
     item["Company"] = sel.xpath('//div[@class ="stammdaten"]//tr//td/text()').extract()[-1] 
     item["Video_PIs"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z5"]/text()').extract() 
     item["Video_Visits"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z4"]/text()').extract() 
     item["PIs"] = sel.xpath('//div[@class ="statistik"]//tr[3]//td/text()')[1].extract() 
     item["Visits"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[1].extract() 

     return item

当执行代码，返回任何结果。这是规则定义的问题吗？这里的任何帮助真的很感激！

来源

2017-04-24 Bin Song

我测试你的蜘蛛是正常的蜘蛛（不是CrawlSpider），它从start_url中提取数据。所以这个问题似乎是你已经猜到的CrawlSpider规则。对于我想要在页面上关注哪些链接，这并不是很明显。你可以编辑你的问题，并添加href应该遵循的细节？ –

非常感谢您的支票。我喜欢的链接是ivw-ausweisung中不同竞争对手的在线数据网站。如：http://ausweisung.ivw-online.de/index.php?i=1161&a=o44847,http://ausweisung.ivw-online.de/index.php?i=1161&a=o44851等。唯一的区别这些网址之间的数字是i = 1161＆a = o之后的数字...... –

尽管start_url已经是一个细节页面，我无法找到其他竞争对手的列表，但我在网站层次结构中将作为开始。有一张桌子上有一长串竞争对手。

从这个start_url你可以获取所有的公司的网址，并直接与您的回调创建requests像这样：

class ivwSpider(scrapy.Spider): 

    name = "ivw-online" 
    allowed_domains = ["ausweisung.ivw-online.de"] 
    start_urls = ["http://ausweisung.ivw-online.de/index.php?i=116"] 

    def parse(self, response): 

     sel_rows = response.xpath('//div[@class="daten"]/div[@class="tabelle"]//tr') 

     for sel_row in sel_rows: 
      url_detail = sel_row.xpath('./td[@class="a_main_txt"][1]/a/@href').extract_first() 
      if url_detail: 
       url = response.urljoin(url_detail) 
       # print url 
       yield scrapy.Request(url, callback=self.parse_item) 

    def parse_item(self, response): 

     sel = Selector(response) 

     item = IVWItem() 
     item["Type"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[0].extract() 
     item["Zeitraum"] = sel.xpath('//div[@class ="tabelle"]//tr[1]//div[@style="width:210px; text-align:center;"]/text()')[0].extract() 
     item["Company"] = sel.xpath('//div[@class ="stammdaten"]//tr//td/text()').extract()[-1] 
     item["Video_PIs"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z5"]/text()').extract() 
     item["Video_Visits"] = sel.xpath('//div[@class ="tabelle"]//tr[11]//td[@class ="z4"]/text()').extract() 
     item["PIs"] = sel.xpath('//div[@class ="statistik"]//tr[3]//td/text()')[1].extract() 
     item["Visits"] = sel.xpath('//div[@class ="statistik"]//tr[1]//td/text()')[1].extract() 

     yield item

请注意，该基类不再CrawlSpider但Spider。

来源

2017-04-25 12:48:25

非常感谢！有用 –

需要CrawlSpider帮助scrapy

回答

相关问题