2014-11-21 87 views
0

我想解析数据库文件的每一行,让它准备好导入。它有固定的宽度线,但是以字符为单位,而不是以字节为单位。我编写了一些基于Martineau's answer的内容,但我遇到了特殊字符的问题。使用特殊字符解压缩固定宽度的unicode文件行。 Python UnicodeDecodeError

有时他们会打破预期的宽度,有些时候他们会抛出UnicodeDecodeError。我相信解码错误可能是固定的,但我可以继续这样做struct.unpack并正确解码特殊字符?我认为问题在于它们被编码为多个字节,与预期的字段宽度相混淆,我理解它是以字节为单位而不是字符。

import os, csv 

def ParseLine(arquivo): 
    import struct, string 
    format = "1x 12s 1x 18s 1x 16s" 
    expand = struct.Struct(format).unpack_from 
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode())) 
    for line in arquivo: 
     fields = unpack(line) 
     yield [x.strip() for x in fields] 

Caminho = r"C:\Sample" 
os.chdir(Caminho) 

with open("Sample data.txt", 'r') as arq: 
    with open("Out" + ".csv", "w", newline ='') as sai: 
     Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows 
     for line in ParseLine(arq): 
      Write([line]) 

样本数据:

|  field 1|  field 2  |  field 3 | 
| sreaodrsa | raesodaso t.thl o| .tdosadot. osa | 
| resaodra | rôn. 2x 17/220V | sreao.tttra v | 
| esarod sê | raesodaso t.thl o| .tdosadot. osa | 
| esarod sa í| raesodaso t.thl o| .tdosadot. osa | 

实际输出:

field 1;field 2;field 3 
sreaodrsa;raesodaso t.thl o;.tdosadot. osa 
resaodra;rôn. 2x 17/22;V | sreao.tttra 

在我们看到线1的输出和2如预期。第3行有错误的宽度,可能是由于多字节ô。第4行抛出以下异常:

Traceback (most recent call last): 
    File "C:\Sample\FindSample.py", line 18, in <module> 
    for line in ParseLine(arq): 
    File "C:\Sample\FindSample.py", line 9, in ParseLine 
    fields = unpack(line) 
    File "C:\Sample\FindSample.py", line 7, in <lambda> 
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode())) 
    File "C:\Sample\FindSample.py", line 7, in <genexpr> 
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode())) 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data 

我需要对每个字段进行especific操作,所以当我做之前,我不能对整个文件使用re.sub。我想保留这段代码,因为它看起来很有效率,并且处于工作的边缘。如果有更高效的解析方法,我可以尝试一下。我需要保留特殊字符。

回答

0

事实上,struct方式落下这里,因为它预计领域是字节宽固定数量的,而你的格式使用的码点一个固定的数字。我不想在这里使用struct。你的线条已经被解码为Unicode值,只需用切片提取数据:

def ParseLine(arquivo): 
    slices = [slice(1, 13), slice(14, 32), slice(33, 49)] 
    for line in arquivo: 
     yield [line[s].strip() for s in slices] 

该交易完全在字符在已解码行,而不是字节。如果您有字段宽度,而不是指数,也可以产生slice()对象:

def widths_to_slices(widths): 
    pos = 0 
    for width in widths: 
     pos += 1 # delimiter 
     yield slice(pos, pos + width) 
     pos += width 

def ParseLine(arquivo): 
    widths = (12, 18, 16) 
    for line in arquivo: 
     yield [line[s].strip() for s in widths_to_slices(widths)] 

演示:

>>> sample = '''\ 
... |  field 1|  field 2  |  field 3 | 
... | sreaodrsa | raesodaso t.thl o| .tdosadot. osa | 
... | resaodra | rôn. 2x 17/220V | sreao.tttra v | 
... | esarod sê | raesodaso t.thl o| .tdosadot. osa | 
... | esarod sa í| raesodaso t.thl o| .tdosadot. osa | 
... '''.splitlines() 
>>> def ParseLine(arquivo): 
...  slices = [slice(1, 13), slice(14, 32), slice(33, 49)] 
...  for line in arquivo: 
...   yield [line[s].strip() for s in slices] 
... 
>>> for line in ParseLine(sample): 
...  print(line) 
... 
['field 1', 'field 2', 'field 3'] 
['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa'] 
['resaodra', 'rôn. 2x 17/220V', 'sreao.tttra v'] 
['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa'] 
['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa'] 
+0

我已经使用timeit对一个150MB的文件,这两种方法进行比较。结构方法在108秒内运行,而切片则花费了67分钟。我必须做一些调整才能将它加入我的代码中,这可能会让代码更快,但我现在确信切片是一种很好的方法。谢谢! – mvbentes 2014-11-22 17:36:11

相关问题