在Python3中迭代遍历文件的行时使用“tell（）”的替代方法？

如何在Python3中遍历文件时找到文件指针的位置？在Python3中迭代遍历文件的行时使用“tell（）”的替代方法？

在Python 2.7中它很平凡，使用tell()。在Python3相同的调用抛出OSError：

Traceback (most recent call last): 
    File "foo.py", line 113, in check_file 
    pos = infile.tell() 
OSError: telling position disabled by next() call

我的用例正在一个进度条读取大的CSV文件。计算总计行数太贵，需要额外的通行证。近似值非常有用，我不关心缓冲区或其他噪声源，我想知道它是否需要10秒或10分钟。

重现问题的简单代码。它的工作原理上的Python 2.7的预期，但引发关于Python 3：

file_size = os.stat(path).st_size 
with open(path, "r") as infile: 
    reader = csv.reader(infile) 
    for row in reader: 
     pos = infile.tell() # OSError: telling position disabled by next() call 
     print("At byte {} of {}".format(pos, file_size))

这个答案https://stackoverflow.com/a/29641787/321772表明，问题是next()方法迭代过程中禁用tell()。替代方法是逐行手动读取，但该代码位于CSV模块内部，因此我不能理解它。我也无法通过禁用tell()来了解Python 3的优点。

那么在Python 3中遍历文件的行时，找出字节偏移量的首选方法是什么？

来源

2017-09-25 Adam

你可以使用'枚举'并返回行号。就像那样，你可以给用户一些有用的东西，而不必遍历文件两次 –

@MaartenFabré当然，打印行号是很有用的，如果只是为了显示脚本没有被卡住，并且它也是你所能做的不知道长度（即从标准输入读数）。但是，“完成55％，剩余2分钟”比“读取10,543,000行”要好得多。 – Adam

。 csv模块只是希望reader调用的第一个参数是一个迭代器，它在每个next调用中返回一行，因此您可以使用迭代器包装器来计算字符数，如果您希望计数值准确，以二进制模式打开文件，但事实上，这很好，因为您将没有csv模块预期的行结束转换。

因此，一个可能的包装是：

class SizedReader: 
    def __init__(self, fd, encoding='utf-8'): 
     self.fd = fd 
     self.size = 0 
     self.encoding = encoding # specify encoding in constructor, with utf8 as default 
    def __next__(self): 
     line = next(self.fd) 
     self.size += len(line) 
     return line.decode(self.encoding) # returns a decoded line (a true Python 3 string) 
    def __iter__(self): 
     return self

然后

您的代码将成为：

file_size = os.stat(path).st_size 
with open(path, "rb") as infile: 
    szrdr = SizedReader(infile) 
    reader = csv.reader(szrdr) 
    for row in reader: 
     pos = szrdr.size # gives position at end of current line 
     print("At byte {} of {}".format(pos, file_size))

这里的好消息是，你保持csv模块的所有的力量，包括换行符报价字段...

来源

2017-09-25 15:09:15

这有效。虽然你不需要担心编码，只要拿出你得到的东西，找到它的长度，然后归还它。这样你就不会改变解码行为。还要注意，你需要一个'def next（self）：return self .__ next __（）'，所以相同的代码在Python 2和3上都可以工作。 – Adam

@Adam：这个问题特别关于Python 3。如果你不解码在二进制模式下读取的内容，你将得到字节而不是字符串。 Python2和Python3中csv模块的表现完全不同，这就是为什么我没有尝试给出兼容代码的原因。这确实是可能的，但会更复杂。 –

是的，但问题并未以二进制模式打开文件。 – Adam

如果您没有特别的csv模块感觉舒适。你可以这样做：

import os, csv 

file_size = os.path.getsize('SampleCSV.csv') 
pos = 0 

with open('SampleCSV.csv', "r") as infile: 
    for line in infile: 
     pos += len(line) + 1 # 1 for newline character 
     row = line.rstrip().split(',') 
     print("At byte {} of {}".format(pos, file_size))

但是这可能不是在字段本身包含\情况下工作”

例：1,"Hey, you..",22:04虽然这些也可以采取使用正则表达式的护理

来源

2017-09-25 13:52:50 Siddhesh

在Python3中迭代遍历文件的行时使用“tell（）”的替代方法？

回答

相关问题