2017-04-07 52 views
1

我使用scrapy创建爬虫。并创建一些爬行许多页面的脚本。Scrapy在爬行时不处理所有页面

不幸的是,并非所有脚本都抓取所有页面。有些页面会返回所有页面,其他页面可能只有23或180(每个URL的结果不同)。

import scrapy 

class BotCrawl(scrapy.Spider) 
    name = "crawl-bl2" 
    start_urls = [ 
     'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93', 
    ] 

    def parse(self, response): 
     for product in response.css("ul[class='products row-grid']"): 
      for product in product.css('li'): 
       yield { 
       'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 

       'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 

       'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 

       'kota': product.css('div[class=user-city] a::text').extract(), 

       'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 

      } 

     # next page  

     next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first() 
     if next_page_url is not None: 
      yield scrapy.Request(response.urljoin(next_page_url)) 

它是阻止http请求或可能是我的代码错误吗?

更新代码后Granitosaurus编辑后

还是错误

return blank array

import scrapy 


class BotCrawl(scrapy.Spider): 
    name = "crawl-bl2" 
    start_urls = [ 
     'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93', 
    ] 


def parse(self, response): 
    products = response.css('article.product-display') 
    for product in products: 
     yield { 
     'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 
     'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 
     'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 
     'kota': product.css('div[class=user-city] a::text').extract(), 
     'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 
     } 


    # next page  

    next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first() 
    last_url = "/c/perawatan-kecantikan/perawatan-wajah?page=100&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93" 
    if next_page_url is not last_url: 
     yield scrapy.Request(response.urljoin(next_page_url),dont_filter=True) 

谢谢

回答

1

你的产品XPath是有点靠不住。直接尝试selectic产品的文章,该网站使得它很容易为你使用CSS选择做:

products = response.css('article.product-display') 
for product in products: 
    yield { 
     'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(), 
     'penjual': product.css('h5[class=user__name] a::attr(href)').extract(), 
     'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(), 
     'kota': product.css('div[class=user-city] a::text').extract(), 
     'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract() 
    } 

您可以通过插入inspect_response调试响应:

def parse(self, response): 
    products = response.css('article.product-display') 
    if not products: 
     from scrapy.shell import inspect_response 
     inspect_response(response, self) 
     # will open up python shell here where you can check `response` object 
     # try `view(response)` to open it up in your browser and such. 
+0

它仍然无法抓取的所有页面。那只会回到第28页。https://snag.gy/CpyAXP.jpg –

+0

@RadenJohannesHeryoPriambodo适合我。第28页发生了什么?没有找到产品?您可以添加调试中断点以查看发生了什么,请参阅我的编辑。 – Granitosaurus

+0

我的意思是爬虫停在页面28上。页面1-27运行良好。 @Granitosaurus –