2017-05-07 37 views
2

我刮下的列表页网站的详细信息页面,在每个细节页面一定的差异分析不同的详细信息页面。Scrapy从上市

1日详细页面:

<div class="td-post-content"> 
    <p style="text-align: justify;"> 
     <strong>[ Karda Natam ]</strong> 
     <br> 
     <strong>ITANAGAR, May 6:</strong> Nacho, Taksing, Siyum and ... 
     <br> “Offices are without ... 
    </p> 
</div> 

第二详细页面:

<div class="td-post-content"> 
    <p style="text-align: justify;"> 
     <strong>Guwahati, May 6 (PTI)</strong> Sarbananda Sonowal today ... 
     <br> “Books are a potent tool to create ... 
    </p> 
</div> 

第三详细页面:

<div class="td-post-content"> 
    <h3 style="text-align: justify;"><strong>Flights Of Fantasy</strong></h3> 
    <p style="text-align: justify;"> 
     <strong>[ M Panging ]</strong> 
     <br> This state of denial ... 
    </p> 
</div> 

我试图从细节解析作者和发布日期页码:

class ArunachaltimesSpider(scrapy.Spider): 
    ... 
    ... 

    def parse(self, response): 
     urls = response.css("div.td-ss-main-content > div.td_module_16 > div.item-details > h3.entry-title > a::attr(href)").extract() 
     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse_detail) 

     next = response.xpath("// ...')]/@href").extract_first() 
     if next: 
      yield scrapy.Request(url=next, callback=self.parse) 

    def parse_detail(self, response): 
     strong_elements = response.css("div.td-ss-main-content").css("div.td-post-content").css("p > strong::text").extract() 
     for strong in strong_elements: 
      if ', ' in strong: 
       news_date = strong.split(', ')[1].replace(":", "") 
      elif '[ ' and ' ]' in strong: 
       author = strong 
      else: 
       news_date = None 
       author = None 
     yield { 
      'author': author, 
      'news_date': news_date 
     } 

但我收到此错误:

UnboundLocalError: local variable 'author' referenced before assignment

我在做什么错在这里?您能否请分别从每个页面获取作者和新闻日期。谢谢。

回答

0

貌似strong_elements你的情况空数组。所以for循环不运行。但是你宣布在for循环author变量,你在未申报的产量使用author(becoz for循环不运行)你的情况。宣布author顶级变量如上for循环