2017-09-13 126 views
0

我有一个看起来像这样的文件:在每一行提取文本文件

Breve, a writ; used more frequently in the plural brevia. 
Brevia magistralia, official writs framed by the clerks in 
chancery to meet new injuries, to which the old forms of action 
were inapplicable. Sea Trespass on the case. Brevia testata, 
short attested memoranda, originally introduced to obviate the 
uncertainty arisina; from parol feoffments, hence modern con- 
veyances have gradually arisen. 

我想提取第一个逗号之前出现的词(,)

预期输出:

Breve 
Brevia magistralia 
chancery to meet new injuries 
were inapplicable. Sea Trespass on the case. Brevia testata 
short attested memoranda 
uncertainty arisina; from parol feoffments 

我的代码:

with open('test.txt','r') as file: 
    for line in file: 
     print(line[0:line.find(',')]) 

输出:

Breve 

任何帮助表示赞赏

+0

'+' 演示?(= [^。] + \?):https://regex101.com/r/ YhNuVd/2 – Gurman

+0

我得到更长的输出。 – Goyo

+0

决定一个答案是否有帮助,然后你可以接受最好的答案:https://stackoverflow.com/help/someone-answers –

回答

1

为什么你需要的正则表达式? str.split应该足够好。

with open('test.txt','r') as file: 
    for line in file: 
     text = line.split(',', 1)[0] # add nsplits = 1 for efficiency 
     ... # do something with text 

但是,如果你真的需要正则表达式,你可以使用类似:

for line in file: 
     m = re.match('[^,]+', line) 
     if m: 
      text = m.group(0) 

[^,]+从不是一个逗号(credits)开头匹配任何东西。

+1

不需要懒点和超前,只需're.match('[^,] +',line)'会做。 –

+0

@WiktorStribiżew感谢您帮助我改进我的正则表达式。 :-) –

+0

@coldspeed我尝试了你的方法,他们都没有超过第一线。 我的输出是'Breve' – Ashksta

1

re.findall()解决方案:

import re 
with open('test.txt', 'r') as f: 
    result = re.findall(r'^[^,]+(?=,)', f.read(), re.M) # extracting the needed words 
    print('\n'.join(result)) 

输出:

Breve 
Brevia magistralia 
chancery to meet new injuries 
were inapplicable. Sea Trespass on the case. Brevia testata 
short attested memoranda 
uncertainty arisina; from parol feoffments 
0

我测试你的代码,但我根据你的问题

输出得到了正确的输出:

Breve 
Brevia magistralia 
chancery to meet new injuries 
were inapplicable. Sea Trespass on the case. Brevia testata 
short attested memoranda 
uncertainty arisina; from parol feoffments 
veyances have gradually arisen. 

因此请确保您输入文件本身是正确的

也许你的测试文件没有新线,即整段文字写成一行only.so只有第一个字打印,然后一逗号被发现,所以没有更多的单词被打印。

注:最后一句,没有逗号是发现了所有的话都打印(比你预期的输出不同)

+0

@COLDSPEED我试了代码,它正在工作,文件似乎只写成一行 –

+0

请检查输出@AhmedElkoussy,你错了 – Sanket

+0

@Sanket感谢您的反馈,请您详细说明输出中出了什么问题?我从我的PyCharm终端 –

1

你顺利只是这样做的修改,

with open('test.txt', 'r') as fd: 
    for line in fd: 
     index = line.find(',') 
     if index >= 0: 
      print line[0:index] 

OUTPUT:

Breve 
Brevia magistralia 
chancery to meet new injuries 
were inapplicable. Sea Trespass on the case. Brevia testata 
short attested memoranda 
uncertainty arisina; from parol feoffments 
+0

修改你的逻辑@Ashksta – Sanket

1

这额外的答案,在这里你可以使用重的一个。搜索:

import re 
with open('test.txt','r') as file: 
    for line in file: 
     # print(line) 
     result = re.search(r'^[^,]+(?=,)', line) 
     if result: 
      text = result.group(0) 
      print(text) 

输出:

Breve 
Brevia magistralia 
chancery to meet new injuries 
were inapplicable. Sea Trespass on the case. Brevia testata 
short attested memoranda 
uncertainty arisina; from parol feoffments