我正在编写一个Rails应用程序来从新闻页面获取RSS订阅源,为标题应用词性标注,从标题和次数中获取名词短语每个都发生。我需要过滤掉的名词短语是其他名词短语的一部分,和我使用此代码这样做:在红宝石中过滤哈希中的重复子字符串
filtered_noun_phrases = sorted_noun_phrases.select{|a|
sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h
所以这样的:
{"troops retake main government office"=>2,
"retake main government office"=>2, "main government office"=>2}
应该成为刚:
{"troops retake main government office"=>2}
然而,名词短语,例如这已排序的散列:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
"boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
"silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
"george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
"iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
"haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}
而是只有部分过滤:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2,
"boko haram teen"=>2}
所以,我怎么能过滤重复子了,实际工作散列?
也许这个:filtered_noun_phrases = sorted_noun_phrases.reject {| a | sorted_noun_phrases.keys.any {?| C | b!= a和b.index(a)}} .to_h – trueunlessfalse
谢谢!事后看来,这似乎是一个愚蠢的问题,但我早些时候做到了,它删除了较长的短语并留下了子字符串... –
也许值得一提的是,我不仅仅改变拒绝选择,而且a.index( b)给b.index(a( – trueunlessfalse