2016-01-24 97 views
1

我想抓取一个本地xml文件,该文件位于scrapy的Downloads文件夹中,使用xpath提取相关信息。使用Scrapy抓取本地XML文件 - 起始URL本地文件地址

使用scrapy介绍为guide

2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml' 
2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml' 
2016-01-24 12:38:53 [scrapy] ERROR: Error downloading <GET file://home/sayth/Downloads/20160123RAND0.xml> 

我曾尝试下面几个版本,但是我现在无法获得起始URL接受我的文件。

# -*- coding: utf-8 -*- 
import scrapy 


class MyxmlSpider(scrapy.Spider): 
    name = "myxml" 
    allowed_domains = ["file://home/sayth/Downloads"] 
    start_urls = (
     'http://www.file://home/sayth/Downloads/20160123RAND0.xml', 
    ) 

    def parse(self, response): 
     for file in response.xpath('//meeting'): 
      full_url = response.urljoin(href.extract()) 
      yield scrapy.Request(full_url, callback=self.parse_question) 

    def parse_xml(self, response): 
     yield { 
      'name': response.xpath('//meeting/race').extract() 
     } 

只是为了确认我有在该位置的文件

[email protected] : ~/Downloads 
[0] % ls -a 
.                Building a Responsive Website with Bootstrap [Video].zip 
..                codemirror.zip 
1.1 Situation Of Long Term Gain.xls       Complete-Python-Bootcamp-master.zip 
2008 Racedata.xls            Cox Plate 2005.xls 
20160123RAND0.xml 

回答

5

完全不指定allowed_domains和使用3协议后斜线

start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]