如何仅恢复文本文件中字符串的第二个实例？

我有大量的文本文件（> 1000），所有格式都相同。如何仅恢复文本文件中字符串的第二个实例？

我感兴趣的是该文件的部分看起来像：

# event 9 
num:  1 
length:  0.000000 
otherstuff: 19.9 18.8 17.7 
length: 0.000000 176.123456 

# event 10 
num:  1 
length:  0.000000 
otherstuff: 1.1 2.2 3.3 
length: 0.000000 1201.123456

我只需要定义的变量的第二个实例的第二指标值，在这种情况下的长度。有没有这样做的pythonic方式（即不是sed）？

我的代码如下所示：

with open(wave_cat,'r') as catID: 
     for i, cat_line in enumerate(catID): 
      if not len(cat_line.strip()) == 0: 
       line = cat_line.split() 
       #replen = re.sub('length:','length0:','length:') 
       if line[0] == '#' and line[1] == 'event': 
        num = long(line[2]) 
       elif line[0] == 'length:': 
        Length = float(line[2])

来源

2015-11-07 scootie

这是一个文件的全部内容？ – beezz

不，每个文件有超过10个事件，但都是相同的格式。编辑：我已经改变了上面的文件格式。 – scootie

使用计数器：

with open(wave_cat,'r') as catID: 
    ct = 0 
    for i, cat_line in enumerate(catID): 
     if not len(cat_line.strip()) == 0: 
      line = cat_line.split() 
      #replen = re.sub('length:','length0:','length:') 
      if line[0] == '#' and line[1] == 'event': 
       num = long(line[2]) 
      elif line[0] == 'length:': 
       ct += 1 
       if ct == 2: 
        Length = float(line[2]) 
        ct = 0

来源

2015-11-07 16:37:33

工作，谢谢！ – scootie

你在正确的轨道上。除非你真的需要它，否则它可能会更快地推迟分裂。另外，如果您正在扫描大量文件并且只想要第二个长度条目，那么一旦您看到它，它将节省大量时间以打破循环。

length_seen = 0 
elements = [] 
with open(wave_cat,'r') as catID: 
    for line in catID: 
     line = line.strip() 
     if not line: 
      continue 
     if line.startswith('# event'): 
      element = {'num': int(line.split()[2])} 
      elements.append(element) 
      length_seen = 0 
     elif line.startswith('length:'): 
      length_seen += 1 
      if length_seen == 2: 
       element['length'] = float(line.split()[2])

来源

2015-11-07 16:40:06 chthonicdaemon

这确实加快了速度，谢谢指出！我还在休息之前添加了length_seen = 0，因为在单个文件中存在多个相同文本的副本。 – scootie

我已经修改它来构建文件的元素列表，包括数字和长度。 – chthonicdaemon

如果你可以看到整个文件到内存中，只是做一个regex against the file contents：

for fn in [list of your files, maybe from a glob]: 
    with open(fn) as f: 
     try: 
      nm=pat.findall(f.read())[1] 
     except IndexError: 
      nm='' 
     print nm

如果文件较大，使用mmap：

import re, mmap 

nth=1 
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M) 
for fn in [list of your files, maybe from a glob]: 
    with open(fn, 'r+b') as f: 
     mm = mmap.mmap(f.fileno(), 0) 
     for i, m in enumerate(pat.finditer(mm)): 
      if i==nth: 
       print m.group(1) 
       break

来源

2015-11-07 16:47:54 dawg

如何仅恢复文本文件中字符串的第二个实例？

回答

相关问题