我正在尝试阅读python中的gunzipped文件（.gz），并且遇到了一些麻烦。从python中的gzip文件中读取utf-8字符

我用gzip的模块读取，但该文件编码为UTF-8文本文件，以便最终读取无效字符和崩溃。

有谁知道如何读取编码为utf-8文件的gzip文件？我知道有一个编解码器模块可以帮助，但我无法理解如何使用它。

谢谢！

import string 
import gzip 
import codecs 

f = gzip.open('file.gz','r') 

engines = {} 
line = f.readline() 
while line: 
    parsed = string.split(line, u'\u0001') 

    #do some things... 

    line = f.readline() 
for en in engines: 
    print(en)

来源

2009-12-10 Juan Besa

你可以发布你到目前为止的代码吗？ – 2009-12-10 20:03:42

你能否将utf-8文件转换为ascii然后尝试解压缩？嗯.... – whatsisname 2009-12-10 20:06:06

我不明白为什么这应该是如此艰难。

你到底在做什么？请解释“最终它读取的是无效字符”。

它应该是简单的：

import gzip 
fp = gzip.open('foo.gz') 
contents = fp.read() # contents now has the uncompressed bytes of foo.gz 
fp.close() 
u_str = contents.decode('utf-8') # u_str is now a unicode string

EDITED

这个答案在Python3工程Python2，请参阅@SeppoEnarvi的答案在https://stackoverflow.com/a/19794943/610569（它使用rt模式gzip.open。

来源

2009-12-10 20:11:27 sjbrown

+1 ...这是迄今为止答案中最清晰和最复杂的3个答案。 – 2009-12-10 22:49:23

不一定是最简单的，因为你必须解码你阅读的每一行。在getreader实现中，这会自动发生，所以每行都是unicode – SecurityJoe 2012-01-05 20:37:04

尽管这是一个很好的解决方案，但我有一种感觉，这种解决方案在大文件上不能很好地扩展。 – 2016-11-09 15:59:20

也许

import codecs 
zf = gzip.open(fname, 'rb') 
reader = codecs.getreader("utf-8") 
contents = reader(zf) 
for line in contents: 
    pass

来源

2009-12-10 20:21:02

作为一行代码：用于codecs.getreader（'utf-8'）（gzip.open（fname），errors ='replace'）中的行，这也增加了对错误处理的控制 – SecurityJoe 2012-01-05 20:38:05

在Python的形式（2.5或更高版本）

from __future__ import with_statement # for 2.5, does nothing in 2.6 
from gzip import open as gzopen 

with gzopen('foo.gz') as gzfile: 
    for line in gzfile: 
     print line.decode('utf-8')

来源

2009-12-10 20:26:12

这是可能在Python 3.3：

import gzip 
gzip.open('file.gz', 'rt', encoding='utf-8')

是gzip.open（通知）要求您显式地指定文本模式（ 'T'）。

来源

2013-11-05 17:20:37

上面产生了大量的解码错误。我用这个：

for line in io.TextIOWrapper(io.BufferedReader(gzip.open(filePath)), encoding='utf8', errors='ignore'): 
    ...

来源

2014-08-10 20:13:14 Yurik

从python中的gzip文件中读取utf-8字符

回答

EDITED

相关问题