2017-07-14 102 views
0

这里: IMDB scrapy get all movie data错的Xpath在IMDB蜘蛛scrapy

response.xpath( “// * [@类= '结果']/TR/TD [3]”)

返回空列表。我试图将它改变为:

response.xpath( “// * [含有(@类, '图表全宽度')]/tbody的/ TR”)

没有成功。

请帮忙吗?谢谢。

+0

运行它,你可以指定哪些链接是你是从何时会出现这个问题刮? –

+0

当然,例如: http://www.imdb.com/search/title?year=year=1950,1950&title_type=feature&sort=moviemeter,asc –

+0

我不确定你在这里要做什么。但我检查了网站,并且没有带'class' **结果的**路径**或** **全角** –

回答

0

我没有时间彻底地通过IMDB scrapy get all movie data,但已经有了它的要点。问题陈述是从给定站点获取所有电影数据。它涉及两件事。 首先是要浏览所有包含当年所有电影列表的页面。虽然第二一个是获得每部电影的链接,然后在这里你做你自己的魔法。

您遇到的问题是获取到每个电影的链接的xpath。这很可能是由于网站结构的变化(我没有时间来验证可能的差异)。无论如何,以下是你需要的xpath


FIRST:

我们采取navdiv类作为一个里程碑,找到它的孩子lister-page-next next-page类。

response.xpath("//div[@class='nav']/div/a[@class='lister-page-next next-page']/@href").extract_first() 

这里这将给:链接下一个页|返回None如果在的最后一页(自下页标签不存在)


第二:

这是由OP原来的疑问。

#Get the list of the container having the title, etc 
list = response.xpath("//div[@class='lister-item-content']") 

#From the container extract the required links 
paths = list.xpath("h3[@class='lister-item-header']/a/@href").extract() 

现在您需要做的就是遍历这些paths元素中的每一个并请求页面。


0

感谢您的回答。我最终用你的XPath像这样:

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

from crawler.items import MovieItem 

IMDB_URL = "http://imdb.com" 

class IMDBSpider(CrawlSpider): 
    name = 'imdb' 
    # in order to move the next page 
    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[@class='nav']/div/a[@class='lister-page-next next-page']",)), 
        callback="parse_page", follow= True),) 

    def __init__(self, start=None, end=None, *args, **kwargs): 
     super(IMDBSpider, self).__init__(*args, **kwargs) 
     self.start_year = int(start) if start else 1874 
     self.end_year = int(end) if end else 2017 

    # generate start_urls dynamically 
    def start_requests(self): 
     for year in range(self.start_year, self.end_year+1): 
      # movies are sorted by number of votes 
      yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year)) 

    def parse_page(self, response): 
     content = response.xpath("//div[@class='lister-item-content']") 
     paths = content.xpath("h3[@class='lister-item-header']/a/@href").extract() # list of paths of movies in the current page 

     # all movies in this page 
     for path in paths: 
      item = MovieItem() 
      item['MainPageUrl'] = IMDB_URL + path 
      request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details) 
      request.meta['item'] = item 
      yield request 

    # make sure that the start_urls are parsed as well 
    parse_start_url = parse_page 

    def parse_movie_details(self, response): 
     pass # lots of parsing.... 

scrapy crawl imdb -a start=<start-year> -a end=<end-year>