从2个文件同时读取每4行

我正在处理大文本文件（10 MB gziped）。总是有两个文件属于一起，长度和结构都相同：每个数据集有4行。从2个文件同时读取每4行

我需要同时从两个文件中处理来自第2行的每个块中的数据。

我的问题：什么是最节省时间的方法？

现在我这样做：

def read_groupwise(iterable, n, fillvalue=None): 
    args = [iter(iterable)] * n 
    return itertools.izip_longest(fillvalue=fillvalue, *args) 

f1 = gzip.open(file1,"r") 
f2 = gzip.open(file2,"r") 
for (fline1,fline2,fline3,fline4), (rline1, rline2, rline3, rline4) in zip(read_groupwise(f1, 4), read_groupwise(f2, 4)): 
    # process fline2, rline2

但因为我只需要每2号线，我猜大概有是一个多比较有效的方式来做到这一点？

任何帮助表示赞赏！ Lastalda

来源

2013-02-22 Lastalda

这可以通过建立自己的发电机来完成：

def get_nth(iterable, n, after=1): 
    if after > 1: 
     consume(iterable, after-1) 
    while True: 
     yield next(iterable) 
     consume(iterable, n-1) 

with gzip.open(file1, "r") as f1, gzip.open(file2, "r") as f2: 
    every = (4, 2) 
    for line_f1, line_f2 in zip(get_nth(f1, *every), get_nth(f2, *every)): 
     ...

发电机进入给予（在这种情况下，我们希望第二个项目的第一个项目，所以我们跳过一个地方第二个项目之前的迭代器），然后生成一个值，然后前进到下一个项目之前。这是完成手头任务的一种非常简单的方法。

这里使用consume() from itertools' recipes：

def consume(iterator, n): 
    "Advance the iterator n-steps ahead. If n is none, consume entirely." 
    # Use functions that consume iterators at C speed. 
    if n is None: 
     # feed the entire iterator into a zero-length deque 
     collections.deque(iterator, maxlen=0) 
    else: 
     # advance to the empty slice starting at position n 
     next(islice(iterator, n, n), None)

最后一点，我不知道如果gzip.open()给出了一个上下文管理器，如果没有，您需要使用contextlib.closing()。

来源

2013-02-22 15:20:58

我已经试过了，但它并没有比以前更快。还是）感谢你的建议。 Sidequestion：与“f = open（file）”相比，使用“with open（file）as f ...”的好处是什么？ – Lastalda 2013-02-25 09:56:00

更快并不是最重要的事情 - 可读性是更重要的问题。至于'with'，它会在你退出with块的范围时关闭文件，并且在'try：... finally：...'块中这样做，这意味着即使有是一个例外。它更具可读性，摆脱'file.close（）'混乱，确保文件在所有情况下都关闭，这通常是一个好主意。 – 2013-02-25 14:20:25

我会建议使用马上itertools.izip_longest压缩档案和itertools.islice两者的内容来选择每第四个要素，从2号线

>>> def get_nth(iterable, n, after=1, fillvalue = ""): 
    return islice(izip_longest(*iterable,fillvalue=fillvalue), n, None, after) 

>>> with gzip.open(file1, "r") as f1, gzip.open(file2, "r") as f2: 
    for line in get_nth([f1, f2], n = 2): 
     print map(str.strip, line)

来源

2013-02-22 15:45:23 Abhijit

我已经尝试过了，但它并没有比以前更快。还是）感谢你的建议。 – Lastalda 2013-02-25 09:54:29

开始。如果你有内存中，然后尝试：

ln1 = f1.readlines()[2::4] 
ln2 = f2.readlines()[2::4] 
for fline, rline in zip(ln1, ln2): 
    ...

但只有当你有记忆。

来源

2013-03-21 05:52:01 Paddy3118

从2个文件同时读取每4行

回答

相关问题