2016-01-23 103 views
0

我有以下代码:文本提取线分割与Python

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    # if "1\t\"Overall evaluation" in line: 
    # words = line.split("1\t\"Overall evaluation") 
    # print words[0] 
    number = int(line.split(':')[1].strip('"\n')) 
    print number 

这是能够从我的数据,它看起来像这样抓住了最后的int:

299 1 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 4 
Strength or novelty of the idea (2): 3 
Strength or novelty of the idea (3): 3 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 2 
""Open by default"" (2): 3 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 2 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 4 
Triple bottom line impact (1): 4 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 2 
Knowledge and skills of the team (1): 3 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 3 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 3" 
299 2 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 3 
Strength or novelty of the idea (2): 2 
Strength or novelty of the idea (3): 4 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 3 
""Open by default"" (2): 2 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 3 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 3 
Triple bottom line impact (1): 3 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 1 
Knowledge and skills of the team (1): 4 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 4 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 2" 

364 1 "Overall evaluation: 3 
Invite to interview: 3 
... 

我还需要抓取“记录标识符”,在上面的例子中,前两个实例为299,然后364为下一个实例。

上面的注释掉的代码,如果我删除的最后几行,只是使用它,像这样:

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    if "1\t\"Overall evaluation" in line: 
     words = line.split("1\t\"Overall evaluation") 
     print words[0] 
    # number = int(line.split(':')[1].strip('"\n')) 
    # print number 

可以抓住的记录标识。

但我很难把两者放在一起。

理想的情况是我想要的是类似如下:

368 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2 

等的所有记录。

我该如何结合上述两个脚本组件来实现?

+0

你看起来像一个有经验的用户,应该知道_that_不是用Python处理数据的方式。相反,我建议你处理字典。 –

+0

看起来可能是骗人的。你什么意思? –

+0

我的意思是,该dat.txt文件不是以有利的方式为您解析它。你应该试着让它(比如说,从哪里得到)适当地构造,比如作为字典,所以你唯一需要做的就是传递你想要的密钥(记录标识符,你称它为) –

回答

1

正则表达式是门票。你可以用两种模式来完成。事情是这样的:

import re 

with open('./dat.txt') as fin: 
    for line in fin: 
     ma = re.match(r'^(\d+) \d.+Overall evaluation', line) 
     if ma: 
      print("record identifier %r" % ma.group(1)) 
      continue 
     ma = re.search(r': (\d+)$', line) 
     if ma: 
      print(ma.group(1)) 
      continue 
     print("unrecognized line: %s" % line) 

注意:最后的print语句是不是你要求的一部分,但每当我调试正则表达式,我总是添加某种包罗万象,以协助调试不好的正则表达式语句。一旦我得到我的模式,我删除catchall。