Python - 从文件夹中的所有文件中删除重音

我想从文件夹中的所有编码文件中删除所有重音符号..我已经在构建文件列表中成功，问题是当我尝试使用unicodedata进行标准化我得到的错误： **回溯（最近一次调用最后一次）：文件“/usr/lib/gedit-2/plugins/pythonconsole/console.py”，第336行，在__run中 exec command in self .namespace 文件 “”，第2行，在 UnicodeDecodeError错误：在位置25 'UTF8' 编解码器不能解码字节0xf3：无效延续字节 **Python - 从文件夹中的所有文件中删除重音

if options.remove_nonascii: 
    nERROR = 0 
    print _("# Removing all acentuation from coding files in %s") % (options.folder) 
    exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set() 
    for dirpath, dirnames, filenames in os.walk(options.folder): 
     for filename in (f for f in filenames if f.endswith(exts)): 
      files.add(os.path.join(dirpath,filename)) 
    for i in range(len(files)): 
     f = files.pop() ; 
     os.rename(f,f+'.BACK') 
     with open(f,'w') as File: 
      for line in open(f+'.BACK').readlines(): 
       try: 
        newLine = unicodedata.normalize('NFKD',unicode(line)).encode('ascii','ignore') 
        File.write(newLine) 
       except UnicodeDecodeError: 
        nERROR +=1 
        print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i) 
        newLine = line 
        File.write(newLine)

来源

2011-02-08 canesin

它看起来像文件可能与CP1252编解码器进行编码：

In [18]: print('\xf3'.decode('cp1252')) 
ó

unicode(line)失败，因为unicode试图与utf-8编解码器解码line代替，因此错误UnicodeDecodeError: 'utf8' codec can't decode...。

你可以尝试先用CP1252解码line，如果失败，尝试UTF-8：

if options.remove_nonascii: 
    nERROR = 0 
    print _("# Removing all acentuation from coding files in %s") % (options.folder) 
    exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set() 
    for dirpath, dirnames, filenames in os.walk(options.folder): 
     for filename in (f for f in filenames if f.endswith(exts)): 
      files.add(os.path.join(dirpath,filename)) 
    for i,f in enumerate(files): 
     os.rename(f,f+'.BACK') 
     with open(f,'w') as fout: 
      with open(f+'.BACK','r') as fin: 
       for line fin: 
        try: 
         try: 
          line=line.decode('cp1252') 
         except UnicodeDecodeError: 
          line=line.decode('utf-8') 
          # If this still raises an UnicodeDecodeError, let the outer 
          # except block handle it 
         newLine = unicodedata.normalize('NFKD',line).encode('ascii','ignore') 
         fout.write(newLine) 
        except UnicodeDecodeError: 
         nERROR +=1 
         print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i) 
         newLine = line 
         fout.write(newLine)

顺便说一句，

unicodedata.normalize('NFKD',line).encode('ascii','ignore')

是有点危险。例如，它消除u'ß”和完全的一些引号：

In [23]: unicodedata.normalize('NFKD',u'ß').encode('ascii','ignore') 
Out[23]: '' 

In [24]: unicodedata.normalize('NFKD',u'‘’“”').encode('ascii','ignore') 
Out[24]: ''

如果这是一个问题，然后使用unidecode module：

In [25]: import unidecode 
In [28]: print(unidecode.unidecode(u'‘’“”ß')) 
''""ss

来源

2011-02-08 16:33:38 unutbu

谢谢..使用unidecode解决了！ – canesin 2011-02-08 17:18:18

你可能想使用unicode（line）指定编码，比如unicode（line，'utf-8'）

如果你不知道，sys.getfilesystemencoding（）可能是你的朋友。

来源

2011-02-08 16:12:07 Vince

Python - 从文件夹中的所有文件中删除重音

回答

相关问题