用Scrapy填充的列表在实际填充之前被返回

这涉及到几乎相同的代码，我刚才问了关于今天上午的一个不同的问题，所以如果它看起来很熟悉，那是因为它。用Scrapy填充的列表在实际填充之前被返回

class LbcSubtopicSpider(scrapy.Spider): 

...irrelevant/sensitive code... 

    rawTranscripts = [] 
    rawTranslations = [] 

    def parse(self, response): 
     rawTitles = [] 
     rawVideos = [] 
     for sel in response.xpath('//ul[1]'): #only scrape the first list 

      ...irrelevant code... 

      index = 0 
      for sub in sel.xpath('li/ul/li/a'): #scrape the sublist items 
       index += 1 
       if index%2!=0: #odd numbered entries are the transcripts 
        transcriptLink = sub.xpath('@href').extract() 
        #url = response.urljoin(transcriptLink[0]) 
        #yield scrapy.Request(url, callback=self.parse_transcript) 
       else: #even numbered entries are the translations 
        translationLink = sub.xpath('@href').extract() 
        url = response.urljoin(translationLink[0]) 
        yield scrapy.Request(url, callback=self.parse_translation) 

     print rawTitles 
     print rawVideos 
     print "translations:" 
     print self.rawTranslations 

    def parse_translation(self, response): 
     for sel in response.xpath('//p[not(@class)]'): 
      rawTranslation = sel.xpath('text()').extract() 
      rawTranslation = ''.join(rawTranslation) 
      #print rawTranslation 
      self.rawTranslations.append(rawTranslation) 
      #print self.rawTranslations

我的问题是，“打印self.rawTranslations”在parse(...)方法打印无非"[]"。这可能意味着以下两种情况之一：它可能是在打印之前重置列表，或者可能在打印parse_translation(...)的呼叫之前打印，该链接从链接parse(...)下面的链接填充完列表。我倾向于怀疑它是后者，因为我看不到任何会重置列表的代码，除非课堂体内的"rawTranslations = []"多次运行。值得注意的是，如果我取消注释parse_translation(...)中的同一行，它将打印所需的输出，这意味着它正在正确提取文本，并且问题似乎对于主要的parse(...)方法是唯一的。

我试图解决我认为是一个同步问题的尝试是非常漫无目的 - 我只是尝试使用基于尽可能多Google教程的RLock对象，而且我99％肯定我会滥用它，因为结果是相同的。

来源

2016-07-06 jah

我一直在过去的一个小时里搜索互联网，试图更好地理解Python中的锁定，但并没有走得太远。我的想法是在最后一个子页面访问完成后释放锁，但是我发现了很少的语法示例。 – jah

因此，这似乎是有点的哈克的解决方案，尤其是因为我刚刚得知Scrapy的请求优先级的功能，但这里是我的新代码，使所期望的结果：多少请求完成

class LbcVideosSpider(scrapy.Spider): 

    ...code omitted... 

    done = 0 #variable to keep track of subtopic iterations 
    rawTranscripts = [] 
    rawTranslations = [] 

    def parse(self, response): 
     #initialize containers for each field 
     rawTitles = [] 
     rawVideos = [] 

     ...code omitted... 

      index = 0 
      query = sel.xpath('li/ul/li/a') 
      for sub in query: #scrape the sublist items 
       index += 1 
       if index%2!=0: #odd numbered entries are the transcripts 
        transcriptLink = sub.xpath('@href').extract() 
        #url = response.urljoin(transcriptLink[0]) 
        #yield scrapy.Request(url, callback=self.parse_transcript) 
       else: #even numbered entries are the translations 
        translationLink = sub.xpath('@href').extract() 
        url = response.urljoin(translationLink[0]) 
        yield scrapy.Request(url, callback=self.parse_translation, \ 
         meta={'index': index/2, 'maxIndex': len(query)/2}) 

     print rawTitles 
     print rawVideos 

    def parse_translation(self, response): 
     #grab meta variables 
     i = response.meta['index'] 
     maxIndex = response.meta['maxIndex'] 

     #interested in p nodes without class 
     query = response.xpath('//p[not(@class)]') 
     for sel in query: 
      rawTranslation = sel.xpath('text()').extract() 
      rawTranslation = ''.join(rawTranslation) #collapse each line 
      self.rawTranslations.append(rawTranslation) 

      #increment number of translations done, check if finished 
      self.done += 1 
      print self.done 
      if self.done==maxIndex: 
       print self.rawTranslations

基本上，我只是不停的轨道，使一些代码的请求被国际泳联条件湖这将打印完整填充的列表。

来源

2016-07-08 14:36:45 jah

这里的问题是，你不知道如何scrapy真的有效。

Scrapy是一个爬行框架，用于创建网站蜘蛛，不仅仅是为了做请求，这是requests模块。

当您调用yield Request(...)时，Scrapy的请求异步工作，您将请求添加到将在某个点执行的请求堆栈（您无法控制它）。这意味着你不能指望在那个时刻执行yield Request(...)之后的代码的一部分。事实上，你的方法应该总是结束Request或Item。

现在，从我所看到的以及大多数scrapy混淆的案例中，您都希望继续使用某种方法填充您创建的项目，但是您需要的信息是在不同的请求中。

在这种情况下，通信通常是与meta参数Request的，像这样做：

... 
    yield Request(url, callback=self.second_method, meta={'item': myitem, 'moreinfo': 'moreinfo', 'foo': 'bar'}) 

def second_method(self, response): 
    previous_meta_info = response.meta 
    # I can access the previous item with `response.meta['item']` 
    ...

来源

2016-07-06 22:45:31 eLRuLL

我不知道这会改变这种情况，但我试图填充的项目不是在方法中创建的，而是作为类中的对象创建的。 – jah

它似乎不是元信息的方式来做到这一点，或至少不是我如何构造一切。原因是我只能从'parse'或'parse_translation'产生另一个请求，并且这两种方法都不能在完整状态下传递'self.rawTranslations'。如果我在'parse'中做，它会传递一个空列表;如果我在'parse_translation'中执行它，它将会像列表条目一样多次调用我的第三个（未写入的）方法，而不是在最后一次。我认为这与容器在方法之外有关。 – jah

用Scrapy填充的列表在实际填充之前被返回

回答

相关问题