如何同时抓取和抓取数据？

这是我第一次使用网络抓取的经验，我不知道我是否做得好。关键是我想同时抓取和抓取数据。如何同时抓取和抓取数据？

得到所有我会刮掉
商店他们到MongoDB的

访问逐一刮其内容

# Crawling: get all links to be scrapped later on 
class LinkCrawler(Spider): 
    name="link" 
    allowed_domains = ["website.com"] 
    start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)] 
    def parse(self,response): 
     # loop for all pages 
     next_page = Selector(response).xpath('//li[@class="active"]/following-sibling::li[1]/a/@href').extract() 

     if not not next_page: 
      yield Request("https://"+next_page[0], callback = self.parse) 

     # loop for all links in a single page 
     links = Selector(response).xpath('//div[@class="row-fluid job-details pointer"]/div[@class="bloc-right"]/div[@class="row-fluid"]') 

     for link in links: 
      item = Link() 
      url = response.urljoin(link.xpath('a/@href')[0].extract()) 
      item['url'] = url 
      items.append(item) 

     for item in items: 
      yield item 

# Scraping: get all the stored links on MongoDB and scrape them????

来源

2017-07-13 geek-tech

究竟什么是你的用例？您是否主要对其导致的页面的链接或内容感兴趣？即是否有任何理由先将这些链接存储在MongoDB中，然后再删除页面？如果您确实需要在MongoDB中存储链接，最好使用item pipeline来存储这些项目。在链接中，甚至还有在MongoDB中存储项目的例子。如果你需要更复杂的东西，看看scrapy-mongodb包。

除此之外，还有对您发布的实际代码一些意见：

而不是Selector(response).xpath(...)使用只是response.xpath(...)。
如果您只需要选择器中第一个提取的元素，请使用extract_first()而不是使用extract()和索引。
请勿使用if not not next_page:，请使用if next_page:。
不需要items的第二个循环，yield循环中的项目需要links。

来源

2017-07-13 09:21:38

嘿，非常感谢。我在刮的网站是电子商务网站，人们出售物品，一旦出售，他们将其删除。因此，为了让我知道哪些产品销售得很快，我认为我必须保存链接，以便稍后检查是否删除或不删除。另外，如果有可能在mongodb上存储该链接之前刮取每个链接的内容，请告诉我该怎么做？ –

如果指向个别产品的链接遵循一些常见模式，则最好使用['CrawlSpider']（https://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider）和适当的规则。 –

是的个别产品，但有一个tuto在那里？我想访问每一个链接，并提取在那里暴露的数据... –

如何同时抓取和抓取数据？

回答

相关问题