python - 从特定文本行读取文件

我不是在谈论特定的行号，因为我正在阅读多个文件，格式相同但长度不同。
说我有这个文本文件：python - 从特定文本行读取文件

Something here... 
... ... ... 
Start      #I want this block of text 
a b c d e f g 
h i j k l m n 
End      #until this line of the file 
something here... 
... ... ...

我希望你知道我的意思。我正在考虑迭代文件，然后使用正则表达式来查找“开始”和“结束”的行号，然后使用linecache从开始行读取结束行。但是如何获得行号？我可以使用什么功能？

来源

2011-09-26 BPm

这个问题是非常相似，这一个http://stackoverflow.com/questions/7098530/repeatedly-extract-a-line-between-two-delimiters-in-a-text-file-python – salomonvh

如果你只是想开始和结束之间的文本块，你可以做喜欢的事，很简单：

with open('test.txt') as input_data: 
    # Skips text before the beginning of the interesting block: 
    for line in input_data: 
     if line.strip() == 'Start': # Or whatever test is needed 
      break 
    # Reads text until the end of the block: 
    for line in input_data: # This keeps reading the file 
     if line.strip() == 'End': 
      break 
     print line # Line is extracted (or block_of_lines.append(line), etc.)

其实，你不需要为了读取数据来操作行号开始和结束标记之间。

在两个块中重复逻辑（“直到...”），但它非常清晰和高效（其他方法通常涉及检查某些状态[在块/块内/结束块达到之前]，这会导致时间处罚）。

来源

2011-09-26 18:29:28 EOL

这应该是一个开始给你：

started = False 
collected_lines = [] 
with open(path, "r") as fp: 
    for i, line in enumerate(fp.readlines()): 
     if line.rstrip() == "Start": 
      started = True 
      print "started at line", i # counts from zero ! 
      continue 
      if started and line.rstrip()=="End": 
      print "end at line", i 
      break 
      # process line 
      collected_lines.append(line.rstrip())

的enumerate发电机以一个生成器和枚举迭代。例如，

print list(enumerate("a b c".split()))

打印

[ (0, "a"), (1,"b"), (2, "c") ]

UPDATE：

海报要求使用正则表达式匹配线，如 “===” 和 “==”：

import re 
print re.match("^=+$", "===")  is not None 
print re.match("^=+$", "======") is not None 
print re.match("^=+$", "=")  is not None 
print re.match("^=+$", "=abc") is not None 
print re.match("^=+$", "abc=") is not None

来源

2011-09-26 18:22:51 rocksportrocker

这是一些可以工作的东西：

data_file = open("test.txt") 
block = "" 
found = False 

for line in data_file: 
    if found: 
     block += line 
     if line.strip() == "End": break 
    else: 
     if line.strip() == "Start": 
      found = True 
      block = "Start" 

data_file.close()

来源

2011-09-26 18:23:48 orlp

这是一个巧妙的技巧 – BPm

@BPm：这是一个“有限状态机”（http://en.wikipedia.org/wiki/Finite_state_machine）的例子：机器启动时处于“Block not yet found”状态（找到== False），一直运行在“块内”状态（找到== True），在这种情况下，当找到“End”时停止运行。它们可能有点低效（这里，必须检查块中的每一行都找到'found'），但状态机通常允许用户清晰地表达更复杂算法的逻辑。 – EOL

+1，因为这是完全有效的状态机方法的一个很好的例子。 – EOL

你可以很容易地使用正则表达式。你可以根据需要使它更健壮，下面是一个简单的例子。

>>> import re 
>>> START = "some" 
>>> END = "Hello" 
>>> test = "this is some\nsample text\nthat has the\nwords Hello World\n" 
>>> m = re.compile(r'%s.*?%s' % (START,END),re.S) 
>>> m.search(test).group(0) 
'some\nsample text\nthat has the\nwords Hello'

来源

2011-09-26 20:23:02 pyInTheSky

+1：非常好的想法：这是紧凑的，并且可能非常有效，因为're'模块很快。尽管如此，在你的正则表达式中（'^ ... $'），START和END标签应该被强制自己排成一行。 – EOL

谢谢:)）我不认为你可以使用^ || $当你使用重新。S规范，因为它包含换行符，认为你需要明确地说'％s \ n。*？％s \ n' – pyInTheSky

在这种情况下，您肯定可以使用^ ... $，只需添加re.MULTILINE标志（ http://docs.python.org/dev/library/re.html#module-contents）。 – EOL

python - 从特定文本行读取文件

回答

相关问题