2017-03-07 50 views
0

我正在编写一个Rails应用程序来从新闻页面获取RSS订阅源,为标题应用词性标注,从标题和次数中获取名词短语每个都发生。我需要过滤掉的名词短语是其他名词短语的一部分,和我使用此代码这样做:在红宝石中过滤哈希中的重复子字符串

filtered_noun_phrases = sorted_noun_phrases.select{|a| 
    sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h 

所以这样的:

{"troops retake main government office"=>2, 
"retake main government office"=>2, "main government office"=>2} 

应该成为刚:

{"troops retake main government office"=>2} 

然而,名词短语,例如这已排序的散列:

{"troops retake main government office"=>2, "chinese students fighting racism"=>2, 
"retake main government office"=>2, "mosul retake government base"=>2, 
"toddler killer shot dead"=>2, "students fighting racism"=>2, 
"retake government base"=>2, "main government office"=>2, 
"white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2, 
"cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2, 
"boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2, 
"silver surfers"=>2, "house tourists"=>2, "natural causes"=>2, 
"george michael"=>2, "instagram fame"=>2, "hacking tools"=>2, 
"iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2, 
"haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2} 

而是只有部分过滤:

{"troops retake main government office"=>2, "chinese students fighting racism"=>2, 
"retake main government office"=>2, "mosul retake government base"=>2, 
"toddler killer shot dead"=>2, "students fighting racism"=>2, 
"retake government base"=>2, "main government office"=>2, 
"white house tourists"=>2, "horn at french zoo"=>2, 
"cia hacking tools"=>2, "killer shot dead"=>2, 
"boko haram teen"=>2} 

所以,我怎么能过滤重复子了,实际工作散列?

+0

也许这个:filtered_noun_phrases = sorted_noun_phrases.reject {| a | sorted_noun_phrases.keys.any {?| C | b!= a和b.index(a)}} .to_h – trueunlessfalse

+0

谢谢!事后看来,这似乎是一个愚蠢的问题,但我早些时候做到了,它删除了较长的短语并留下了子字符串... –

+0

也许值得一提的是,我不仅仅改变拒绝选择,而且a.index( b)给b.index(a( – trueunlessfalse

回答

0

什么目前你正在做的是选择所有短语的任何短语存在即是短语的字符串。

对于“重新夺回主要政府办公室”这是真实的,因为我们发现“重新获得主要政府办公室”。

但是,对于“重新担任主要政府职位”,我们仍然找到“主要政府职位”,因此不会将其过滤掉。

做例如:

filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h 

,你可以拒绝它的任何字符串存在包括短语所有词组。

+0

谢谢,入选答案! –

0
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h 

- trueunlessfalse

+0

啊,谢谢你!我刚刚加了一个答案,上面有一点解释。干杯。 – trueunlessfalse