文本提取线分割与Python

我有以下代码：文本提取线分割与Python

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    # if "1\t\"Overall evaluation" in line: 
    # words = line.split("1\t\"Overall evaluation") 
    # print words[0] 
    number = int(line.split(':')[1].strip('"\n')) 
    print number

这是能够从我的数据，它看起来像这样抓住了最后的int：

299 1 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 4 
Strength or novelty of the idea (2): 3 
Strength or novelty of the idea (3): 3 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 2 
""Open by default"" (2): 3 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 2 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 4 
Triple bottom line impact (1): 4 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 2 
Knowledge and skills of the team (1): 3 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 3 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 3" 
299 2 "Overall evaluation: 3 
Invite to interview: 3 
Strength or novelty of the idea (1): 3 
Strength or novelty of the idea (2): 2 
Strength or novelty of the idea (3): 4 
Use or provision of open data (1): 4 
Use or provision of open data (2): 3 
""Open by default"" (1): 3 
""Open by default"" (2): 2 
Value proposition and potential scale (1): 4 
Value proposition and potential scale (2): 3 
Market opportunity and timing (1): 4 
Market opportunity and timing (2): 3 
Triple bottom line impact (1): 3 
Triple bottom line impact (2): 2 
Triple bottom line impact (3): 1 
Knowledge and skills of the team (1): 4 
Knowledge and skills of the team (2): 4 
Capacity to realise the idea (1): 4 
Capacity to realise the idea (2): 4 
Capacity to realise the idea (3): 4 
Appropriateness of the budget to realise the idea: 2" 

364 1 "Overall evaluation: 3 
Invite to interview: 3 
...

我还需要抓取“记录标识符”，在上面的例子中，前两个实例为299，然后364为下一个实例。

上面的注释掉的代码，如果我删除的最后几行，只是使用它，像这样：

f = open('./dat.txt', 'r') 
array = [] 
for line in f: 
    if "1\t\"Overall evaluation" in line: 
     words = line.split("1\t\"Overall evaluation") 
     print words[0] 
    # number = int(line.split(':')[1].strip('"\n')) 
    # print number

可以抓住的记录标识。

但我很难把两者放在一起。

理想的情况是我想要的是类似如下：

368 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2 

=2+3+3+3+4+3+2+3+2+3+2+3+2+3+2+3+2+4+3+2+3+2

等的所有记录。

我该如何结合上述两个脚本组件来实现？

来源

2016-01-23 s.matthew.english

你看起来像一个有经验的用户，应该知道_that_不是用Python处理数据的方式。相反，我建议你处理字典。 –

看起来可能是骗人的。你什么意思？ –

我的意思是，该dat.txt文件不是以有利的方式为您解析它。你应该试着让它（比如说，从哪里得到）适当地构造，比如作为字典，所以你唯一需要做的就是传递你想要的密钥（记录标识符，你称它为） –

正则表达式是门票。你可以用两种模式来完成。事情是这样的：

import re 

with open('./dat.txt') as fin: 
    for line in fin: 
     ma = re.match(r'^(\d+) \d.+Overall evaluation', line) 
     if ma: 
      print("record identifier %r" % ma.group(1)) 
      continue 
     ma = re.search(r': (\d+)$', line) 
     if ma: 
      print(ma.group(1)) 
      continue 
     print("unrecognized line: %s" % line)

注意：最后的print语句是不是你要求的一部分，但每当我调试正则表达式，我总是添加某种包罗万象，以协助调试不好的正则表达式语句。一旦我得到我的模式，我删除catchall。

来源

2016-01-23 17:44:52 user590028

文本提取线分割与Python

回答

相关问题