2016-04-25 47 views
0

我正在寻找一种方式来经过一个句子,看看是否撇号是报价或收缩,所以我可以从字符串中删除标点,然后规范所有单词。搞清楚,如果一个单引号是报价或收缩

我的测试一句话是:don't frazzel the horses. 'she said wow'.

在我的努力我已经分裂句成词的部分tokonizing上字和非词,像这样:

contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"] 

sentence = "don't frazzel the horses. 'she said wow'.".split(/(\w+)|(\W+)/i).reject! { |word| word.empty? } 

这将返回["don", "'", "t", " ", "frazzel", " ", "the", " ", "horses", ". '", "she", " ", "said", " ", "wow", "'."]

下一页我希望能够遍历句子寻找撇号',当找到一个时,比较下一个元素,看它是否包含在contractionEndings数组中。如果包含我想加入前缀,撇号',并将后缀加入一个索引,否则删除撇号。

在这个例子中,don',和t将被连接成don't作为一个单一的索引,但. ''.将被移除。

之后,我可以运行一个正则表达式从句子删除其他标点符号,这样我可以将它传递到我的词干正常化输入。

最终输出我后don't frazzel the horses she said wow中,所有的标点将除了撇号宫缩被删除。

如果任何人有任何建议,使这项工作或者有关于如何解决这个问题,我想知道一个更好的主意。

总的来说,我想从句子中删除所有的标点,除了收缩。

谢谢

+0

什么导致你想到最后? – Ilya

+0

@Ilya'不frazzel她说wow' –

+2

为什么要急于选择一个答案?为什么不等待至少UNT马那些处理答案的人有机会发布? –

回答

1

正如我在评论中提到的,我认为试图列出所有可能的收缩结局是徒劳的。事实上,一些收缩,比如“不可能”,包含不止一个撇号。

另一种选择是匹配单引号。我的第一个想法是删除字符"'"如果是在句子的开头或空格之后,或者如果它跟在空格之后或者在句子的结尾。不幸的是,这种方法受到以“s”结尾的占有性词语的困扰:“Chris的猫有跳蚤”。更糟糕的是,我们如何解读“Chris'汽车'在哪里?”或者“'在圣诞节前的那个夜晚'。”?

这是一种删除单引号的方法,当单词的开头或结尾没有撇号时(这确实是有问题的值)。

r =/
    (?<=\A|\s) # match the beginning of the string or a whitespace char in a 
       # positive lookbehind 
    \'   # match a single quote 
    |   # or 
    \'   # match a single quote 
    (?=\s|\z) # match a whitespace char or the end of the string in a 
       # positive lookahead 
    /x   # free-spacing regex definition mode 

"don't frazzel the horses. 'she said wow'".gsub(r,'') 
    #=> "don't frazzel the horses. she said wow" 

我认为最好的解决方案是让英语使用不同的符号来表示撇号和单引号。

0

通常撇号在逗号后会保持收缩状态。

尝试一个正常的NLP标记器,例如,在python nltk

>>> from nltk import word_tokenize 
>>> word_tokenize("don't frazzel the horses") 
['do', "n't", 'frazzel', 'the', 'horses'] 

对于多个句子:

>>> from string import punctuation 
>>> from nltk import sent_tokenize, word_tokenize 
>>> text = "don't frazzel the horses. 'she said wow'." 
>>> sents = sent_tokenize(text) 
>>> sents 
["don't frazzel the horses.", "'she said wow'."] 
>>> [word for word in word_tokenize(sents[0]) if word not in punctuation] 
['do', "n't", 'frazzel', 'the', 'horses'] 
>>> [word for word in word_tokenize(sents[1]) if word not in punctuation] 
["'she", 'said', 'wow'] 

压扁句子word_tokenize前:

>>> from itertools import chain 
>>> sents 
["don't frazzel the horses.", "'she said wow'."] 
>>> [word_tokenize(sent) for sent in sents] 
[['do', "n't", 'frazzel', 'the', 'horses', '.'], ["'she", 'said', 'wow', "'", '.']] 
>>> list(chain(*[word_tokenize(sent) for sent in sents])) 
['do', "n't", 'frazzel', 'the', 'horses', '.', "'she", 'said', 'wow', "'", '.'] 
>>> [word for word in list(chain(*[word_tokenize(sent) for sent in sents])) if word not in punctuation] 
['do', "n't", 'frazzel', 'the', 'horses', "'she", 'said', 'wow'] 

注意,单引号保持与'she。可悲的是,符号化的简单的任务还是有它的弱点之中精良(深)的机器学习方法所有的炒作今日=(

这使得即使有正式的语法文字错误:

>>> text = "Don't frazzel the horses. 'She said wow'." 
>>> sents = sent_tokenize(text) 
>>> sents 
["Don't frazzel the horses.", "'She said wow'."] 
>>> [word_tokenize(sent) for sent in sents] 
[['Do', "n't", 'frazzel', 'the', 'horses', '.'], ["'She", 'said', 'wow', "'", '.']] 
1

您可以使用Pragmatic Tokenizer gem它可以检测English contractions

s = "don't frazzel the horses. 'she said wow'." 
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s) 
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"] 

s = "'Twas the 'night before Christmas'." 
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s) 
=> ["'twas", "the", "night", "before", "christmas"] 

s = "He couldn’t’ve been right." 
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s) 
=> ["he", "couldn’t’ve", "been", "right"] 
+0

PS - Pragmatic Tokenizer还有一个[展开收缩](https://github.com/diasks2/pragmatic_tokenizer#expand_contractions)选项。 – diasks2