2011-03-02 72 views
0

我试图抓取一个网站并将结果保存并格式化为CSV文件。我能够保存文件,但是有关于输出和格式化三个问题:格式化Scrapy的CSV结果

  • 所有结果存在于一个细胞,而不是多条线路上。在列出项目时,是否忘记使用命令,以便它们出现在列表中?

  • 如何删除每个结果前面的['u...? (我搜查了print,但看不到return

  • 有没有办法给某些项目结果添加文本? (例如,我可以添加 “http://groupon.com” 每个deallink结果的开始?)

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

from deals.items import DealsItem 

class DealsSpider(BaseSpider): 
    name = "groupon.com" 
    allowed_domains = ["groupon.com"] 
    start_urls = [ 
     "http://www.groupon.com/chicago/all", 
     "http://www.groupon.com/new-york/all" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="page_content clearfix"]') 
     items = [] 
     for site in sites: 
      item = DealsItem() 
      item['deal1']  = site.select('//div[@class="c16_grid_8"]/a/@title').extract() 
      item['deal1link'] = site.select('//div[@class="c16_grid_8"]/a/@href').extract() 
      item['img1']  = site.select('//div[@class="c16_grid_8"]/a/img/@src').extract() 
      item['deal2']  = site.select('//div[@class="c16_grid_8 last"]/a/@title').extract() 
      item['deal2link'] = site.select('//div[@class="c16_grid_8 last"]/a/@href').extract() 
      item['img2']  = site.select('//div[@class="c16_grid_8 last"]/a/img/@src').extract() 
      items.append(item) 
     return items 

回答

2

编辑:现在,我理解这个问题更好。你的parse()函数应该看起来更像下面的代码吗?也就是yield - 一次只写一个项目,而不是返回一个列表。我怀疑你正在返回的列表是格式不正确的单元格。

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//div[@class="page_content clearfix"]') 
    for site in sites: 
     item = DealsItem() 
     item['deal1']  = site.select('//div[@class="c16_grid_8"]/a/@title').extract() 
     item['deal1link'] = site.select('//div[@class="c16_grid_8"]/a/@href').extract() 
     item['img1']  = site.select('//div[@class="c16_grid_8"]/a/img/@src').extract() 
     item['deal2']  = site.select('//div[@class="c16_grid_8 last"]/a/@title').extract() 
     item['deal2link'] = site.select('//div[@class="c16_grid_8 last"]/a/@href').extract() 
     item['img2']  = site.select('//div[@class="c16_grid_8 last"]/a/img/@src').extract() 
     yield item 
+0

您好Martin - 感谢您的快速回复。要获得csv文件,我使用的代码是:scrapy crawl groupon.com --set FEED_URI = results.csv --set FEED_FORMAT = csv – William 2011-03-02 20:23:00

+0

啊,对不起 - 我还没有意识到Scrapy应该做的那对你。在这种情况下,我对你的帮助不大。 – 2011-03-02 20:32:37

+0

不用担心,谢谢你的努力! – William 2011-03-02 20:34:29