2013-03-20 48 views
0

这是我计算词频码词频计数

word_arr= ["I", "received", "this", "in", "email", "and", "found", "it", "a", "good", "read", "to", "share......", "Yes,", "Dr", "M.", "Bakri", "Musa", "seems", "to", "know", "what", "is", "happening", "in", "Malaysia.", "Some", "of", "you", "may", "know.", "He", "is", "a", "Malay", "extra horny", "horny nor", "nor their", "their babes", "babes are", "are extra", "extra SEXY..", "SEXY.. .", ". .", ". .It's", ".It's because", "because their", "their CONDOMS", "CONDOMS are", "are Made", "Made In", "In China........;)", "China........;) &&"] 

arr_stop_kwd=["a","and"] 

frequencies = Hash.new(0) 
    word_arr.each { |word| 
     if !arr_stop_kwd.include?(word.downcase) && !word.match('&&') 
     frequencies["#{word.downcase}"] += 1 
     end 
    } 

当我有100K的数据将采取9.03秒,即,S来多少时间我可以计算出任何其它方式

THX提前

回答

2

看看Facets gem

你可以做这样的事情使用frequency method

require 'facets' 
frequencies = (word_arr-arr_stop_kwd).frequency 

请注意,可以从word_arr中减去停用词。参考Array Documentation

+0

先生我使用红宝石1.8.7当我需要'facets'我发现一个错误堆栈级别太深我该如何解决这个 – 2013-03-20 11:06:49

+0

你需要安装宝石。尝试运行'gem install facets'或者添加'facets'到您的'.gemfile'如果你正在使用bundler – 2013-03-20 11:20:15

+0

我已经安装了它们 – 2013-03-20 11:28:19