如何在解析python字符串时保留重复标点符号？

我需要处理少量文本（即python中的字符串）。如何在解析python字符串时保留重复标点符号？

我想删除某些标点符号（如'.', ',', ':', ';',）

，但保持标点符号表示像（'...', '?', '??','???', '!', '!!', '!!!'）

也有七情六欲的，我想删除无信息的词作为'a', 'an', 'the'。此外，到目前为止最大的挑战是如何解析“我有”或“我们有”最终得到“我有”和“我们有”？撇号使我感到困难。

什么是最好/最简单的方法来做到这一点在Python中？

例如：

"I've got an A mark!!! Such a relief... I should've partied more."

结果我想：

['I', 'have', 'got', 'A', 'mark', '!!!', 'Such', 'relief', '...', 

'I', 'should', 'have', 'partied', 'more']

来源

2016-02-12 Oleksandra

运行你试过* *什么做到这一点？ –

是的！我已经尝试了几个正则表达式，但是我要实现一个或另一个目标，而不是全部。 – Oleksandra

然后发布他们并解释什么是错的，也许有人可以帮助解决它们。 –

这可能会变得复杂，这取决于你想多少规则适用。

您可以在正则表达式中使用\b来匹配单词的开始或结尾。有了这个功能，您还可以隔离标点并检查它们是否为列表中的单个字符，例如[.;:]。

这些想法在这段代码中使用：

import re 

def tokenise(txt): 
    # Expand "'ve" 
    txt = re.sub(r"(?i)(\w)'ve\b", r'\1 have', txt) 
    # Separate punctuation from words 
    txt = re.sub(r'\b', ' ', txt) 
    # Remove isolated, single-character punctuation, 
    # and articles (a, an, the) 
    txt = re.sub(r'(^|\s)([.;:]|[Aa]n|a|[Tt]he)($|\s)', r'\1\3', txt)  
    # Split into non-empty strings 
    return filter(bool, re.split(r'\s+', txt)) 

# Example use 
txt = "I've got an A mark!!! Such a relief... I should've partied more." 
words = tokenise(txt) 
print (','.join(words))

输出：

我，有，有，A，标志，!!!，这样，浮雕，...，I ，应该有，了宴会，更

看到它在eval.in

来源

2016-02-12 20:43:00 trincot

如何在解析python字符串时保留重复标点符号？

回答

相关问题