2013-04-11 76 views
1

所有页面后,因为我想格式化XML输出下面是我对管道代码:Scrapy如何运行功能,被爬

class TutorialPipeline(object): 

    def __init__(self): 
     self.file = open('outs.xml', 'a') 
     self.file.write('<?xml version=\'1.0\' encoding=\'utf-8\'?>') 
     self.file.write('<Friends>') 
     dispatcher.connect(self.spider_closed, signal=signals.spider_closed) 

    def spider_closed(self, spider): 
     self.file.write('</Friends>') 
     self.file.close() 

    def process_item(self, item, spider): 
     escape("< & >") 
     self.file.write('<friend id=\"' + item['id'] + '\">') 
     self.file.write('<birthdate>' + item['birthdate'] + '</date>') 
     self.file.write('<user>' + item['user'] + '</user>') 
     self.file.write('<review>' + escape(item['review'].encode('utf-8').strip()) + '</review>') 
     self.file.write('</item >') 
     return item  

下面是我的蜘蛛怎么我有多个页面抓取:

class SavoySpider(BaseSpider): 
    # identifies of the Spider 
    name = "friend" 
    count = 0 
    allowed_domains = ["example.com"] 
    start_urls = [ 
     "http://www.example.com/biz/social/" 
    ] 

    def start_requests(self): 
     for i in range(0,1000,40): 
      yield self.make_requests_from_url("http://www.example.com/biz/social/?start=%d" % i) 

    def parse(self, response): 
     response = response.replace(body=response.body.replace('<br />', '\n')) 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//ul/li') 
     items = [] 
     for site in sites: 
      item = FriendItem() 
      self.count += 1 
      item['id'] = str(self.count) 
      item['birthdate'] = str(site.select('.//div/div/meta[@itemprop="birthdate"]/@content').extract()[0]) 
      item['user'] = site.select('h4/span/text()').extract()[0] 
      item['review'] = ''.join(site.select('.//div[@class="media-friend"]/p/text()').extract()) 
      items.append(item) 
     return items 

但现在的问题是,如果我使用管道来自定义xml格式,当抓取另一个页面将被追加到下面的页面和后续页面。输出将变成如下所示:

<?xml version="1.0" encoding="utf-8"?> 
<Friends> 
    <friend id = "1"> 
    <name>Name1</name> 
    <birthdate>1988-04-03</birthdate> 
    <review>txt............</review> 
    </friend> 
    ..... 
</Friends> 
<?xml version="1.0" encoding="utf-8"?> 
<Friends> 
    <friend id = "40"> 
    <name>Name41</name> 
    <birthdate>1988-04-13</birthdate> 
    <review>txt............</review> 
    </friend> 
    ..... 
</Friends> 
<?xml version="1.0" encoding="utf-8"?> 
<Friends> 
    <friend id = "81"> 
    <name>Name81</name> 
    <birthdate>1988-04-23</birthdate> 
    <review>txt............</review> 
    </friend> 
    ..... 
</Friends> 

任何人都可以帮忙吗?

+0

你想实现什么?你想把输出写入不同的文件吗? – 2016-01-29 13:01:27

回答