2017-07-14 52 views
0

我有一个相当巨大的json文件,其中包含剧本中的短行。我试图在json文件中将关键字与关键字进行匹配,这样我就可以从json中取出一行。Ruby在大json中搜索匹配

JSON文件的结构是这样的:

[ 
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ", 
"Ok, yeah, you guys got to put a negative spin on everything. ", 
"No no I'm not ready, things are starting to happen. ", 
"Ok, it's forgotten. ", 
"Yeah, ok. ", 
"Hey hey, whoa come on give me a hug... " 
] 

(加上其它更多... 2444行总)

到目前为止,我有这个,但它没有做任何的比赛。

# screenplay is read in from a json file 
@screenplay_lines = JSON.parse(@jsonfile.read) 
@text_to_find = ["relationship","negative","hug"] 

@matching_results = [] 
@screenplay_lines.each do |line| 
    if line.match(Regexp.union(@text_to_find)) 
    @matching_results << line 
    end 
end 

puts "found #{@matching_results.length} matches..." 
puts @matching_results 

我没有得到任何命中所以不知道什么是不工作的。另外我相信这是一个非常昂贵的过程,这样做有大量的数据。有任何想法吗?谢谢。

回答

1

是的,正则表达式匹配比如果字符串中包含的文本行只是检查慢。但是这也取决于关键字的数量和线条的长度等等。所以最好的方法是运行至少一个微基准。

lines = [ 
"Yeah, well I wasn't looking for a long term relationship. I was on TV. ", 
"Ok, yeah, you guys got to put a negative spin on everything. ", 
"No no I'm not ready, things are starting to happen. ", 
"Ok, it's forgotten. ", 
"Yeah, ok. ", 
"Hey hey, whoa come on give me a hug... " 
] 
keywords = ["relationship","negative","hug"] 


def find1(lines, keywords) 
    regexp = Regexp.union(keywords) 

    lines.select { |line| regexp.match(line) } 
end 


def find2(lines, keywords) 
    lines.select { |line| keywords.any? { |keyword| line.include?(keyword) } } 
end 

def find3(lines, keywords) 
    regexp = Regexp.union(keywords) 

    lines.select { |line| regexp.match?(line) } 
end 

require 'benchmark/ips' 

Benchmark.ips do |x| 
    x.compare! 
    x.report('match') { find1(lines, keywords) } 
    x.report('include?') { find2(lines, keywords) } 
    x.report('match?') { find3(lines, keywords) } 
end 

在此设置的include?变种方式更快:

Comparison: 
      include?: 288083.4 i/s 
       match?: 91505.7 i/s - 3.15x slower 
       match: 65866.7 i/s - 4.37x slower 

请注意:

  • 我搬到了正则表达式的创造出循环。它不需要为每一行创建。创建正则表达式是一项昂贵的操作(您的变体以循环外正则表达式的速度的1/5计算)
  • match?仅在Ruby 2.4+中可用,速度更快,因为它不分配任何匹配结果(无副作用)

我不会担心2500行文本的性能。如果速度够快,则停止搜索更好的解决方案。

+0

谢谢你。伟大的见解。 find1()方法工作正常 – matski

0

有一个可能的解决方案,试试这个:

json_expressions

+0

谢谢,这看起来很有趣,但我想看看如果解决方案使用较少的代码是可能的,然后我诉诸第三方的宝石。 – matski