2011-04-30 89 views
41
  1. 我有一个包含unicode字符串的字典列表。
  2. csv.DictWriter可以将字典列表写入CSV文件。
  3. 我想CSV文件以UTF8编码。
  4. csv模块无法处理将unicode字符串转换为UTF8。
  5. csv模块文档具有用于一切转换为UTF8一个例子:Python DictWriter编写UTF-8编码的CSV文件

    def utf_8_encoder(unicode_csv_data): 
        for line in unicode_csv_data: 
         yield line.encode('utf-8') 
    
  6. 它也有一个UnicodeWriter类。

但是...如何使DictWriter与这些工作?难道他们不得不在自己的中间注入自己,在将它们写入文件之前赶上反汇编的字典并对它们进行编码?我不明白。

回答

71

如果使用Python 2.7或更高版本,使用一个字典理解重新映射字典为UTF-8传递到DictWriter之前:

# coding: utf-8 
import csv 
D = {'name':u'马克','pinyin':u'mǎkè'} 
f = open('out.csv','wb') 
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly) 
w = csv.DictWriter(f,sorted(D.keys())) 
w.writeheader() 
w.writerow({k:v.encode('utf8') for k,v in D.items()}) 
f.close() 

你可以使用这个想法来更新UnicodeWriter到DictUnicodeWriter:

# coding: utf-8 
import csv 
import cStringIO 
import codecs 

class DictUnicodeWriter(object): 

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, D): 
     self.writer.writerow({k:v.encode("utf-8") for k,v in D.items()}) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for D in rows: 
      self.writerow(D) 

    def writeheader(self): 
     self.writer.writeheader() 

D1 = {'name':u'马克','pinyin':u'Mǎkè'} 
D2 = {'name':u'美国','pinyin':u'Měiguó'} 
f = open('out.csv','wb') 
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly) 
w = DictUnicodeWriter(f,sorted(D.keys())) 
w.writeheader() 
w.writerows([D1,D2]) 
f.close() 
+0

我认为降级到Python(x,y)2.6.6.0会让事情变得更简单。 :) – endolith 2011-04-30 01:50:41

+9

@endolith:你可以使用'dict((k,v.encode('utf-8')if isinstance(v,unicode)else v)for k,v in D.iteritems())''而不是dict理解Python 2.6。 – jfs 2011-04-30 05:37:38

+4

'if isinstance(v,unicode)'部分是必不可少的! – reubano 2014-03-06 07:42:43

2

当您将csv.writer与您的内容联系起来时,其想法是通过utf_8_encoder传递内容,因为它会为您提供(utf-8)编码内容。

6

你可以使用一些代理类编码为需要的字典值,如:

# -*- coding: utf-8 -*- 
import csv 
d = {'a':123,'b':456, 'c':u'Non-ASCII: проверка'} 

class DictUnicodeProxy(object): 
    def __init__(self, d): 
     self.d = d 
    def __iter__(self): 
     return self.d.__iter__() 
    def get(self, item, default=None): 
     i = self.d.get(item, default) 
     if isinstance(i, unicode): 
      return i.encode('utf-8') 
     return i 

with open('some.csv', 'wb') as f: 
    writer = csv.DictWriter(f, ['a', 'b', 'c']) 
    writer.writerow(DictUnicodeProxy(d)) 
14

您可以将值转换为UTF-8的飞行。当你穿过字典内DictWriter.writerow()。例如:

import csv 

rows = [ 
    {'name': u'Anton\xedn Dvo\u0159\xe1k','country': u'\u010cesko'}, 
    {'name': u'Bj\xf6rk Gu\xf0mundsd\xf3ttir', 'country': u'\xcdsland'}, 
    {'name': u'S\xf8ren Kierkeg\xe5rd', 'country': u'Danmark'} 
    ] 

# implement this wrapper on 2.6 or lower if you need to output a header 
class DictWriterEx(csv.DictWriter): 
    def writeheader(self): 
     header = dict(zip(self.fieldnames, self.fieldnames)) 
     self.writerow(header) 

out = open('foo.csv', 'wb') 
writer = DictWriterEx(out, fieldnames=['name','country']) 
# DictWriter.writeheader() was added in 2.7 (use class above for <= 2.6) 
writer.writeheader() 
for row in rows: 
    writer.writerow(dict((k, v.encode('utf-8')) for k, v in row.iteritems())) 
out.close() 

输出foo.csv

name,country 
Antonín Dvořák,Česko 
Björk Guðmundsdóttir,Ísland 
Søren Kierkegård,Danmark 
+0

不错的一个。我喜欢实现一个内胆作家功能。 – shahjapan 2013-02-11 15:04:45

+6

'writer.writerow(dict((k,v.encode('utf-8')if type(v)is unicode else v)for k,v in row.iteritems())) 只编码unicode字符。因为int/list没有unicode属性。 – 2014-11-06 02:12:28

1

我的解决方案有点不同。虽然上述所有解决方案都着眼于具有Unicode兼容的字典,但我的解决方案使DictWriter与Unicode相兼容​​。这种方法甚至在python文档中建议(1)。

类UTF8Recoder,UnicodeReader,UnicodeWriter取自python文档。 UnicodeWriter-> authorow也改变了一点。

将其用作常规DictWriter/DictReader。

下面是代码:

import csv, codecs, cStringIO 

class UTF8Recoder: 
    """ 
    Iterator that reads an encoded stream and reencodes the input to UTF-8 
    """ 
    def __init__(self, f, encoding): 
     self.reader = codecs.getreader(encoding)(f) 

    def __iter__(self): 
     return self 

    def next(self): 
     return self.reader.next().encode("utf-8") 

class UnicodeReader: 
    """ 
    A CSV reader which will iterate over lines in the CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     f = UTF8Recoder(f, encoding) 
     self.reader = csv.reader(f, dialect=dialect, **kwds) 

    def next(self): 
     row = self.reader.next() 
     return [unicode(s, "utf-8") for s in row] 

    def __iter__(self): 
     return self 

class UnicodeWriter: 
    """ 
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding. 
    """ 

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): 
     # Redirect output to a queue 
     self.queue = cStringIO.StringIO() 
     self.writer = csv.writer(self.queue, dialect=dialect, **kwds) 
     self.stream = f 
     self.encoder = codecs.getincrementalencoder(encoding)() 

    def writerow(self, row): 
     self.writer.writerow([unicode(s).encode("utf-8") for s in row]) 
     # Fetch UTF-8 output from the queue ... 
     data = self.queue.getvalue() 
     data = data.decode("utf-8") 
     # ... and reencode it into the target encoding 
     data = self.encoder.encode(data) 
     # write to the target stream 
     self.stream.write(data) 
     # empty queue 
     self.queue.truncate(0) 

    def writerows(self, rows): 
     for row in rows: 
      self.writerow(row) 

class UnicodeDictWriter(csv.DictWriter, object): 
    def __init__(self, f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds): 
     super(UnicodeDictWriter, self).__init__(f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds) 
     self.writer = UnicodeWriter(f, dialect, **kwds) 
31

有使用妙UnicodeCSV模块的简单的解决方法。拥有它之后,只需更改行

import csv 

import unicodecsv as csv 

它自动地开始播放尼斯UTF-8。

注意:切换到Python 3也可以解决这个问题(谢谢jamescampbell的提示)。无论如何,这是应该做的。

+4

omfg终于 - 这是一个多么噩梦,直到​​这 – 2016-06-18 07:24:53

+3

这应该是接受的答案 - 这么简单,像一个魅力 – 2016-10-23 16:38:04

+1

你不再需要这样做在Python 3.x – jamescampbell 2017-12-15 17:46:25