2016-12-02 83 views
3

我想提取有关从几篇文章中受伤的人的信息。问题在于以新闻语言传达这些信息的方式不同,因为它可以用数字或文字书写。正则表达式结合列表中的数字写成字

例如:

`Security forces had *wounded two* gunmen inside the museum but that two or three accomplices might still be at large.` 

`The suicide bomber has wounded *four men* last night.` 

`*Dozens* were wounded in a terrorist attack.` 

我注意到,因为大部分时间数字,1-10去的都写在单词而不是数字。我想知道如何提取它们而不会产生任何令人费解的代码,只需从1-10的单词列出正则表达式即可。

我应该使用一个列表吗?它将如何包括在内?

这是我迄今为止用于提取人与数字受伤人数的模式:

text_open = open("News") 
text_read = text_open.read() 
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) injured|(\d+) people were wounded|wounding (\d+)|wounding at least (\d+)") 
result = re.findall(pattern,text_read) 
print(result) 

回答

1

试试这个

import re 

regex = r"(\w)+\s(?=were)|(?<=wounded|injured)\s[\w]{3,}" 

test_str = ("`Security forces had wounded two gunmen inside the museum but that two or three accomplices might still be at large.`\n\n" 
    "`The suicide bomber has wounded four men last night.`\n\n" 
    "`Dozens were wounded in a terrorist attack.") 

matches = re.finditer(regex, test_str) 

for match in matches:  
    print (match.group().strip()) 

输出:

two 
four 
Dozens 

\w+\s(?=were)?=展望未来were,找到捕获字使用\w

|

(?<=wounded|injured)\s\w{3,}?<=如果受伤或受伤的字前发生和{3,}平均字的长度为3个或更多,只是为了避免拍摄字即in,每个数字字有分钟向后看,捕捉字长度为3,所以可以使用它。