我如何跳转到Scrapy的下一页

我想用scrapy从here中刮取结果。问题是，直到单击“加载更多结果”选项卡时，并非所有类都出现在页面上。我如何跳转到Scrapy的下一页

的问题可以在这里看到：

我的代码如下所示：

class ClassCentralSpider(CrawlSpider): 
    name = "class_central" 
    allowed_domains = ["www.class-central.com"] 
    start_urls = (
     'https://www.class-central.com/courses/recentlyAdded', 
    ) 
    rules = (
     Rule(
      LinkExtractor(
       # allow=("index\d00\.html",), 
       restrict_xpaths=('//div[@id="show-more-courses"]',) 
      ), 
      callback='parse', 
      follow=True 
     ), 
    ) 

def parse(self, response): 
    x = response.xpath('//span[@class="course-name-text"]/text()').extract() 
    item = ClasscentralItem() 
    for y in x: 
     item['name'] = y 
     print item['name'] 

    pass

来源

2016-07-25 Yato

那么第二个页面的网址是什么样的呢？如果它像www.website.com/Recently_Added/2那么这将是一个非常简单的解决方案。或者你实际上只是试图获取载入更多结果中出现的数据？ – SAMO

这不起作用。我不知道如何获得网址第2页或致电[加载下一个..] – Yato

我们会是这仅仅是一个例子，我说如果URL以一种明显的模式变化，你可以利用它。好吧，你只是试图从'加载更多结果'选项卡中获得结果 – SAMO

本网站的第二页似乎通过AJAX调用生成。如果你看看任何浏览器检查工具的网络选项卡上，你会看到类似这样的：

在这种情况下，它似乎是从https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134

检索JSON文件现在看来网址参数_=1469471093134什么也不做，所以你可以只修剪客场：https://www.class-central.com/maestro/courses/recentlyAdded?page=2
返回JSON包含下一页的html代码：

# so you just need to load it up with 
data = json.loads(response.body) 
# and convert it to scrapy selector - 
sel = Selector(text=data['table'])

要在您的代码中进行复制，请尝试如下所示：

from w3lib.url import add_or_replace_parameter 
def parse(self, response): 
    # check if response is json, if so convert to selector 
    if response.meta.get('is_json',False): 
     # convert the json to scrapy.Selector here for parsing 
     sel = Selector(text=json.loads(response.body)['table']) 
    else: 
     sel = Selector(response) 
    # parse page here for items 
    x = sel.xpath('//span[@class="course-name-text"]/text()').extract() 
    item = ClasscentralItem() 
    for y in x: 
     item['name'] = y 
     print(item['name']) 
    # do next page 
    next_page_el = respones.xpath("//div[@id='show-more-courses']") 
    if next_page_el: # there is next page 
     next_page = response.meta.get('page',1) + 1 
     # make next page url 
     url = add_or_replace_parameter(url, 'page', next_page) 
     yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)

来源

2016-07-25 18:34:33 Granitosaurus

响应= json.loads（response.load）使用此counvert响应是json选择器？并得到最后的'}'请求（url，self.parse，meta = {'page'：next_page，'is_json'：True}）？ – Yato

我在我的答案中编辑了'parse（）'方法来做你的解析工作，但是用分页。我没有测试代码，但我认为你可以修复一些错别字，如果你自己发现它们:) – Granitosaurus

我可以修复它的另一个错误。也许。：D。谢谢！！！ – Yato

我如何跳转到Scrapy的下一页

回答

相关问题