python findall，正则表达式，unicode

我试图编写一个Python脚本，通过目录树进行搜索并列出所有.flac文件并从resp中派生Arist，Album和Title。 dir/subdir /文件名并将其写入文件。该代码工作正常，直到它击中一个Unicode字符。下面的代码：python findall，正则表达式，unicode

import os, glob, re 

def scandirs(path): 
    for currentFile in glob.glob(os.path.join(path, '*')): 
    if os.path.isdir(currentFile): 
     scandirs(currentFile) 
    if os.path.splitext(currentFile)[1] == ".flac": 
     rpath = os.path.relpath(currentFile) 
     print "**DEBUG** rpath =", rpath 
     title = os.path.basename(currentFile) 
     title = re.findall(u'\d\d\s(.*).flac', title, re.U) 
     title = title[0].decode("utf8") 
     print "**DEBUG** title =", title 
     fpath = os.path.split(os.path.dirname(currentFile)) 
     artist = fpath[0][2:] 
     print "**DEBUG** artist =", artist 
     album = fpath[1] 
     print "**DEBUG** album =", album 
     out = "%s | %s | %s | %s\n" % (rpath, artist, album, title) 
     flist = open('filelist.tmp', 'a') 
     flist.write(out) 
     flist.close() 

scandirs('./')

码输出：

**DEBUG** rpath = Thriftworks/Fader/Thriftworks - Fader - 01 180°.flac 
**DEBUG** title = 180° 
**DEBUG** artist = Thriftworks 
**DEBUG** album = Fader 
Traceback (most recent call last): 
    File "decflac.py", line 25, in <module> 
    scandirs('./') 
    File "decflac.py", line 7, in scandirs 
    scandirs(currentFile) 
    File "decflac.py", line 7, in scandirs 
    scandirs(currentFile) 
    File "decflac.py", line 20, in scandirs 
    out = "%s | %s | %s | %s\n" % (rpath, artist, album, title) 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 46: ordinal not in range(128)

但是在Python控制台尝试时，它工作正常：

>>> import re 
>>> title = "Thriftworks - Fader - 01 180°.flac" 
>>> title2 = "dummy" 
>>> title = re.findall(u'\d\d\s(.*).flac', title, re.U) 
>>> title = title[0].decode("utf8") 
>>> out = "%s | %s\n" % (title2, title) 
>>> print out 
dummy | 180°

所以，我的问题： 1）为什么相同的代码在控制台中工作，但不在脚本中？ 2）如何修复脚本？

来源

2015-02-06 Maarten T.

当使用带有包含Unicode字符的文件名的glob时，请使用Unicode字符串作为模式。这使得glob返回Unicode字符串而不是字节字符串。输出时，打印Unicode字符串会自动将它们编码为控制台的编码。如果您的歌曲具有不受控制台编码支持的字符，您仍然会遇到问题。在这种情况下，将数据写入UTF-8编码文件，并在支持UTF-8的编辑器中查看。

>>> import glob 
>>> for f in glob.glob('*'): print f 
... 
ThriftworksFaderThriftworks - Fader - 01 180░.flac 
>>> for f in glob.glob(u'*'): print f 
... 
ThriftworksFaderThriftworks - Fader - 01 180°.flac

这适用于os.walk也，是做递归搜索更简单的方法：

#!python2 
import os, fnmatch 

def scandirs(path): 
    for path,dirs,files in os.walk(path): 
     for f in files: 
      if fnmatch.fnmatch(f,u'*.flac'): 
       album,artist,tracktitle = f.split(u' - ') 
       print 'Album: ',album 
       print 'Artist:',artist 
       title,track = tracktitle.split(u' ',1) 
       track = track[:-5] 
       print 'Track: ',track 
       print 'Title: ',title 

scandirs(u'.')

输出：

来源

2015-02-07 22:44:19

谢谢，马克。仍然无法让它与u前缀glob一起工作，但是使用os.walk而不是glob构造，脚本在unicode和Python2中工作得很好。 – 2015-02-09 12:25:48

Python控制台与您的终端一起工作，并根据其语言环境解释unicode编码。

替换为新str.format行：

out = u"{} | {} | {} | {}\n".format(rpath, artist, album, title)

和编码为utf8写入文件时：

with open('filelist.tmp', 'a') as f: 
    f.write(out.encode('utf8'))

或import codecs直接做：

with codecs.open('filelist.tmp', 'a', encoding='utf8') as f: 
    f.write(out)

或，因为utf8是默认的：

with open('filelist.tmp', 'a') as f: 
    f.write(out)

来源

2015-02-06 12:42:56 eumiro

谢谢您的答复，并说明有关控制台和语言环境。不幸的是，提出的代码修复似乎不起作用;当用'u'为'out'的值加上前缀时，脚本将以相同的错误停止。唯一一次我可以让它通过'out ='的时候是在注释'title = title [0] .decode（“utf8”）'行而不是用'u'前缀'out'时。但后来这个剧本在写作声明中变得cra亮起来。同样的错误。 – 2015-02-06 22:23:14

*我尝试了全部三条建议书写声明 – 2015-02-06 22:31:50

在控制台中，您的终端设置定义了编码。现在，这主要是Unicode的统一，例如， Windows上的Linux/BSD/MacOS和Windows-1252。在解释器中，它默认为python文件的编码，通常是ascii（除非您的代码以UTF Byte-Order-Mark开头）。
我并不完全确定，但可能在字符串“％s |％s |％s |％s \ n”前面加上u以使其成为unicode字符串可能有所帮助。

来源

2015-02-06 12:43:09 llogiq

感谢您对控制台和解释器之间差异的解释。总体感觉。不幸的是，建议的u前缀不起作用，请参阅我的回复eumiro的帖子。 – 2015-02-06 22:29:56

通过切换到Python3解决，该Python3按预期处理unicode情况。
替补：

title = title[0].decode("utf8")

为：

title = title[0]

甚至没有需要的 '出' 与 'U' 前缀值或写指定的编码。
我爱Python3。

来源

2015-02-07 01:21:01

python findall，正则表达式，unicode

回答

相关问题