2015-11-07 55 views
0

我有大量的文本文件(> 1000),所有格式都相同。如何仅恢复文本文件中字符串的第二个实例?

我感兴趣的是该文件的部分看起来像:

# event 9 
num:  1 
length:  0.000000 
otherstuff: 19.9 18.8 17.7 
length: 0.000000 176.123456 

# event 10 
num:  1 
length:  0.000000 
otherstuff: 1.1 2.2 3.3 
length: 0.000000 1201.123456 

我只需要定义的变量的第二个实例的第二指标值,在这种情况下的长度。有没有这样做的pythonic方式(即不是sed)?

我的代码如下所示:

with open(wave_cat,'r') as catID: 
     for i, cat_line in enumerate(catID): 
      if not len(cat_line.strip()) == 0: 
       line = cat_line.split() 
       #replen = re.sub('length:','length0:','length:') 
       if line[0] == '#' and line[1] == 'event': 
        num = long(line[2]) 
       elif line[0] == 'length:': 
        Length = float(line[2]) 
+0

这是一个文件的全部内容? – beezz

+0

不,每个文件有超过10个事件,但都是相同的格式。编辑:我已经改变了上面的文件格式。 – scootie

回答

0

使用计数器:

with open(wave_cat,'r') as catID: 
    ct = 0 
    for i, cat_line in enumerate(catID): 
     if not len(cat_line.strip()) == 0: 
      line = cat_line.split() 
      #replen = re.sub('length:','length0:','length:') 
      if line[0] == '#' and line[1] == 'event': 
       num = long(line[2]) 
      elif line[0] == 'length:': 
       ct += 1 
       if ct == 2: 
        Length = float(line[2]) 
        ct = 0 
+0

工作,谢谢! – scootie

0

你在正确的轨道上。除非你真的需要它,否则它可能会更快地推迟分裂。另外,如果您正在扫描大量文件并且只想要第二个长度条目,那么一旦您看到它,它将节省大量时间以打破循环。

length_seen = 0 
elements = [] 
with open(wave_cat,'r') as catID: 
    for line in catID: 
     line = line.strip() 
     if not line: 
      continue 
     if line.startswith('# event'): 
      element = {'num': int(line.split()[2])} 
      elements.append(element) 
      length_seen = 0 
     elif line.startswith('length:'): 
      length_seen += 1 
      if length_seen == 2: 
       element['length'] = float(line.split()[2]) 
+0

这确实加快了速度,谢谢指出!我还在休息之前添加了length_seen = 0,因为在单个文件中存在多个相同文本的副本。 – scootie

+0

我已经修改它来构建文件的元素列表,包括数字和长度。 – chthonicdaemon

1

如果你可以看到整个文件到内存中,只是做一个regex against the file contents

for fn in [list of your files, maybe from a glob]: 
    with open(fn) as f: 
     try: 
      nm=pat.findall(f.read())[1] 
     except IndexError: 
      nm='' 
     print nm 

如果文件较大,使用mmap:

import re, mmap 

nth=1 
pat=re.compile(r'^# event.*?^length:.*?^length:\s[\d.]+\s(\d+\.\d+)', re.S | re.M) 
for fn in [list of your files, maybe from a glob]: 
    with open(fn, 'r+b') as f: 
     mm = mmap.mmap(f.fileno(), 0) 
     for i, m in enumerate(pat.finditer(mm)): 
      if i==nth: 
       print m.group(1) 
       break 
相关问题