2013-10-30 56 views
1

我已读取的文件,并将它们分割成单词的数组:显示信息给用户

file1 = File.open("spam1.txt","rb") 
file1_contents = file1.read 
file1 = file1_contents.split(' ') 

我可以计算单词的频率,使用散列,并

freqs1 = Hash.new(0) 
file1.each { |word| freqs1[word] +=1} 
freqs1 = freqs1.sort_by {|x,y| y} 
freqs1.reverse! 

也可以将结果输出给用户这样的:

freqs.each{|word, freq| puts word + ' ' + freq.to_s} 

我根据词的出现次数进行排序想要向用户显示消息,如果数组file1或散列freqs1包含某些词多次

我有一个(坏)主意遍历freqs1散列和显示适当的消息给用户:

freqs1.each{|word,freq| 
    if ((word == ('business' || 'fund' || 'funds' || 'account' ||'transfer' || 'money')) && freq > 2) || (word == 'Iraq' && freq > 1) then 
     puts "File 1 is suspected as spam mail - suspicious word frequency" 
    else 
     puts "File 1 does not appear to be spam email" 
    end 
} 

然而,这是我傻的,因为这作用于hash中的每个元素。

如果像business, fund, funds, account等字样出现超过两次,我怎样才能向用户显示某个消息?

感谢您的任何帮助。

回答

1

如果你只是希望改善的是最后陈述,试试这个(未测试,但应该去):

bad_words = %w{business fund funds account transfer money} 
is_spam = freqs1.any? do |word, freq| 
    (freq > 2 && bad_words.include?(word)) || (word == 'Iraq' && freq > 1) 
end 

if is_spam 
    puts "File 1 is suspected as spam mail - suspicious word frequency" 
else 
    puts "File 1 does not appear to be spam email" 
end 

Enumerable#any?会做的大部分工作的你,还抽取名单坏词有助于可读性。

1

我会做这样的事情:

word_filter = [ 
{count: 2, words: ['business','fund','funds','account','transfer','money']}, 
{count: 1, words: ['iraq']} 
] 

alert  = "File 1 is suspected as spam mail - suspicious word frequency" 
calm_message = "File 1 does not appear to be spam email" 

grouped_words = file1.group_by{|x|x}.map{|x,array|[x,array.size]} 

appears_to_be_spam = grouped_words.any?{ |word,count| 
    word_filter.any? do |filter| 
    filter[:words].include?(word.downcase) && count > filter[:count] 
    end 
} 

puts appears_to_be_spam ? alert : calm_message 
+0

感谢 - 这工作,@Nick Veys是早期的答案,从而不得不接受他的 - 但我喜欢这种方法。 – Tom