2017-08-03 124 views
0

我是Scrapy的新手。我正尝试使用媒体管道下载文件。但是,当我运行蜘蛛没有文件存储在文件夹中。Scrapy Media Pipeline,文件无法下载

蜘蛛:

import scrapy 
from scrapy import Request 
from pagalworld.items import PagalworldItem 

class JobsSpider(scrapy.Spider): 
    name = "songs" 
    allowed_domains = ["pagalworld.me"] 
    start_urls =['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html'] 

    def parse(self, response): 
     urls = response.xpath('//div[@class="pageLinkList"]/ul/li/a/@href').extract() 

     for link in urls: 

      yield Request(link, callback=self.parse_page,) 




    def parse_page(self, response): 
     songName=response.xpath('//li/b/a/@href').extract() 
     for song in songName: 
      yield Request(song,callback=self.parsing_link) 


    def parsing_link(self,response): 
     item= PagalworldItem() 
     item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract() 
     yield{"download_link":item['file_urls']} 

项目文件:

import scrapy 


class PagalworldItem(scrapy.Item): 


    file_urls=scrapy.Field() 

设置文件:

BOT_NAME = 'pagalworld' 

SPIDER_MODULES = ['pagalworld.spiders'] 
NEWSPIDER_MODULE = 'pagalworld.spiders' 
ROBOTSTXT_OBEY = True 
CONCURRENT_REQUESTS = 5 
DOWNLOAD_DELAY = 3 
ITEM_PIPELINES = { 

'scrapy.pipelines.files.FilesPipeline': 1 
} 
FILES_STORE = '/tmp/media/' 

输出看起来像这样:enter image description here

+0

你没有写任何代码来下载/保存文件。去这里,得到一些想法。 https://stackoverflow.com/questions/36135809/using-scrapy-to-to-find-and-download-pdf-files-from-a-website希望这可以帮助 – Nabin

回答

2
def parsing_link(self,response): 
    item= PagalworldItem() 
    item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract() 
    yield{"download_link":item['file_urls']} 

您的收益率:

yield {"download_link": ['http://someurl.com']} 

其中用于scrapy的媒体/文件流水线工作,你需要产生和包含file_urls场项目。所以试试这个:

def parsing_link(self,response): 
    item= PagalworldItem() 
    item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract() 
    yield item 
+0

早些时候,我试图crawlspider解析,但它没有' t工作https://stackoverflow.com/questions/45447451/scrapy-results-are-repeating你可以看到它 – emon