Python - 查找unicode/ascii问题

我是csv.reader从一张很长的表单中获取信息。我正在做这个数据集的工作，然后我使用xlwt包给我一个可行的excel文件。Python - 查找unicode/ascii问题

不过，我得到这个错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128)

我给你的一切问题，我怎么能找到确切位置的错误是在我的数据集？另外，是否有一些我可以编写的代码可以查看我的数据集并找出问题所在（因为一些数据集运行时没有上述错误而其他问题存在）？

2010-05-02 user330739

的答案非常简单，实际上是：只要您从文件中读取数据，将其转换使用文件的编码为Unicode，并处理的UnicodeDecodeError例外：

try: 
     # decode using utf-8 (use ascii if you want) 
     unicode_data = str_data.decode("utf-8") 
except UnicodeDecodeError, e: 
     print "The error is there !"

这会节省你很多的麻烦;您不必担心多字节字符编码问题，而外部库（包括xlwt）只需要编写它就可以做正确的事情。

Python 3.0将强制指定字符串的编码，所以最好现在就做。

来源

2010-05-02 10:26:14 BatchyX

csv模块不支持unicode和空字符。你也许可以做这样的事情，虽然来替换它们（将“UTF-8”与您的CSV数据在编码中的编码）：

import codecs 
import csv 

class AsciiFile: 
    def __init__(self, path): 
     self.f = codecs.open(path, 'rb', 'utf-8') 

    def close(self): 
     self.f.close() 

    def __iter__(self): 
     for line in self.f: 
      # 'replace' for unicode characters -> ?, 'ignore' to ignore them 
      y = line.encode('ascii', 'replace') 
      y = y.replace('\0', '?') # Can't handle null characters! 
      yield y 

f = AsciiFile(PATH) 
r = csv.reader(f) 
... 
f.close()

如果您要查找的字符的位置，其你无法通过CSV模块来处理，你可以做如：

import codecs 

lineno = 0 
f = codecs.open(PATH, 'rb', 'utf-8') 
for line in f: 
    for x, c in enumerate(line): 
     if not c.encode('ascii', 'ignore') or c == '\0': 
      print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x) 
    lineno += 1 
f.close()

或者再次

，你可以使用这个CSV揭幕战，我写的，它可以处理Unicode字符：

import codecs 

def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors): 
    infile = codecs.open(Path, "rb", Encoding, errors=Errors) 
    for Line in infile: 
     Line = Line.strip('\r\n') 
     if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1 
     elif Qualifier != '(None)': 
      # Take a note of the chars 'before' just 
      # in case of excel-style """ quoting. 
      cB41 = ''; cB42 = '' 
      L = [''] 
      qMode = False 
      for c in Line: 
       if c==Qualifier and c==cB41==cB42 and qMode: 
        # Triple qualifiers, so allow it with one 
        L[-1] = L[-1][:-2] 
        L[-1] += c 
       elif c==Qualifier: 
        # A qualifier, so reverse qual mode 
        qMode = not qMode 
       elif c in Delims and not qMode: 
        # Not in qual mode and delim 
        L.append('') 
       else: 
        # Nothing to see here, move along 
        L[-1] += c 
       cB42 = cB41 
       cB41 = c 
      yield L 
     else: 
      # There aren't any qualifiers. 
      cB41 = ''; cB42 = '' 
      L = [''] 
      for c in Line: 
       cB42 = cB41; cB41 = c 
       if c in Delims: 
        # Delim 
        L.append('') 
       else: 
        # Nothing to see here, move along 
        L[-1] += c 
      yield L 

for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace') 
    ...

来源

2010-05-02 10:13:33 cryo

您可以参考代码片段在下面的问题来获得与unicode编码支持CSV阅读：

General Unicode/UTF-8 support for csv files in Python 2.6

来源

2010-05-02 10:48:12 fqsxr

请注明完整回溯您用错误消息相处。当我们知道错误发生的位置（读取CSV文件，“在数据集上进行工作”或使用xlwt编写XLS文件）时，我们可以给出一个有针对性的答案。

您的输入数据很可能并非全是普通的旧ASCII。什么产生它，以什么编码？

要查找的问题（不一定是错误的），都可以尝试这样的小脚本（未经测试）：

import sys, glob 
for pattern in sys.argv[1:]: 
    for filepath in glob.glob(pattern): 
     for linex, line in enumerate(open(filepath, 'r')): 
      if any(c >= '\x80' for c in line): 
       print "Non-ASCII in line %d of file %r" % (linex+1, filepath) 
       print repr(line)

，如果你表现出“坏”线的一些样品这将是有益的你找到，以便我们可以判断编码可能是什么。

我很好奇使用“csv.reader从一张很长的工作表拉动信息” - 什么样的“工作表”？你的意思是你将XLS文件保存为CSV文件，然后阅读CSV文件？如果是这样，您可以使用xlrd直接从输入XLS文件中读取，获得unicode文本，您可以直接输入xlwt，避免任何编码/解码问题。

您是否完成了python-excel.org site的教程？

来源

2010-05-02 11:43:08

Python - 查找unicode/ascii问题

回答

相关问题