走过字符串猜测基于名称字典的电子邮件名称？

比方说，我有一个名称字典（一个巨大的CSV文件）。我想从一个没有明显的可解析点（。， - ，_）的电子邮件中猜出一个名字。我想要做这样的事情：走过字符串猜测基于名称字典的电子邮件名称？

dict = ["sam", "joe", "john", "parker", "jane", "smith", "doe"] 
    word = "johnsmith" 
    x = 0 
    y = word.length-1 
    name_array = [] 
    for i in x..y 
    match_me = word[x..i] 
    dict.each do |name| 
     if match_me == name 
     name_array << name 
     end 
    end 
    end 

    name_array 
    # => ["john"]

不坏，但我想要的“约翰·史密斯”或[“约翰”，“史密斯”]

换句话说，我递归遍历字（即，未分析的电子邮件字符串，“[email protected]”），直到我在字典中找到匹配。 我知道：这是非常低效的。如果有更简单的方法来做到这一点，我全是耳朵！

如果没有更好的方法去做，那么请告诉我如何解决上面的例子，因为它有两个主要缺陷：（1）我如何设置循环的长度（请参阅找到“我（2）如何在上面的例子中增加“x”，这样我就可以在给定任意字符串的情况下遍历所有可能的字符组合？

问题，找到环路的长度，“我”的：

for an arbitrary word, how can we derive "i" given the pattern below? 

    for a (i = 1) 
    a 

    for ab (i = 3) 
    a 
    ab 
    b 

    for abc (i = 6) 
    a 
    ab 
    abc 
    b 
    bc 
    c 

    for abcd (i = 10) 
    a 
    ab 
    abc 
    abcd 
    b 
    bc 
    bcd 
    c 
    cd 
    d 

    for abcde (i = 15) 
    a 
    ab 
    abc 
    abcd 
    abcde 
    b 
    bc 
    bcd 
    bcde 
    c 
    cd 
    cde 
    d 
    de 
    e

来源

2011-11-25 MorningHacker

进一步的研究表明，可以使用三角形序列序列来导出“i”：a（n）= C（n + 1,2）= n（n + 1）/ 2 = 0 + 1 + 2 +。 .. + N。 http://oeis.org/search?q=1%2C+3%2C+6%2C+10%2C+15&language=english&go=Search – MorningHacker

我不敢建议蛮力解决方案，是不是很优雅，但仍然有用的情况下

你有大量的项目（构建正则表达式可能很痛苦）
要分析的字符串不限于两个组件
要获取字符串的所有分割
您只需要完整分析字符串，即从^到$。

因为我的英语不好，我无法找出可以在不止一种方式被分裂的长期个人的名义，让我们分析一个短语：

word = "godisnowhere"

字典：

@dict = [ "god", "is", "now", "here", "nowhere", "no", "where" ] 

@lengths = @dict.collect {|w| w.length }.uniq.sort

数组@lengths增加了对算法的轻微优化，我们将使用它来修剪词典中不存在的词长度的子词，而不实际执行词典查找。该数组是排序的，这是另一个优化。

解决方案的主要部分是一个递归函数，它可以查找给定单词中的初始子字，并重新开始处理尾部子字。

def find_head_substring(word) 

    # boundary condition: 
    # remaining subword is shorter than the shortest word in @dict 
    return [] if word.length < @lengths[0] 

    splittings = [] 

    @lengths.each do |len| 
    break if len > word.length 

    head = word[0,len] 

    if @dict.include?(head) 
     tail = word[len..-1] 

     if tail.length == 0 
     splittings << head 
     else 
     tails = find_head_substring(tail) 
     unless tails.empty? 
      tails.collect!{|tail| "#{head} #{tail}" } 
      splittings.concat tails 
     end 
     end 
    end 
    end 

    return splittings 
end

现在来看看它是如何工作

find_head_substring(word) 
=>["god is no where", "god is now here", "god is nowhere"]

我没有测试过广泛的，所以我提前:)道歉

来源

2011-11-25 16:20:31

我喜欢这里的前进方向，但是当“j”不在字典中时，这种方法对“johnjsmith”有困难。 @锡文的方法似乎忽略了“j”并在字符串内找到其他匹配。 – MorningHacker

虽然...它看起来像我可以将所有单个字母的字母添加到@dict。在这种情况下，你的方法返回“john j smith”。非常好！ – MorningHacker

我不知道你和我在做什么，而不是它简单：

dict.each do |first| 
    dict.each do |last| 
     puts first,last if first+last == word 
    end 
end

来源

2011-11-25 08:41:24 pguardiario

r = /^(#{Regexp.union(dict)})(#{Regexp.union(dict)})$/ 
word.match(r) 
=> #<MatchData "johnsmith" 1:"john" 2:"smith">

正则表达式可能需要一些时间才能构建，但速度非常快。

来源

2011-11-25 08:56:27 Reactormonk

我喜欢它，但我认为你想要^ $界限 – pguardiario

什么是^ $边界为？ – MorningHacker

字符串的开始/结尾 – Reactormonk

这一个包所有出现，不一定正好有两个：

pattern = Regexp.union(dict) 
matches = [] 
while match = word.match(pattern) 
    matches << match.to_s # Or just leave off to_s to keep the match itself 
    word = match.post_match 
end 
matches

来源

2011-11-25 09:05:55 phs

如果你只是想在你的字典比赛的命中：

dict.select{ |r| word[/#{r}/] } 
=> ["john", "smith"]

你冒着太多令人困惑的子目录的风险，所以你可能想排序你的字典如此之久R名称是第一：

dict.sort_by{ |w| -w.size }.select{ |r| word[/#{r}/] } 
=> ["smith", "john"]

您仍然遇到这样的情况，其中一个较长的名称具有更短的子以下，并得到多次点击，所以你需要找出一种方法来剔除那些出来。你可以有一个名字和另一个姓氏的数组，并获取第一个返回的扫描结果，但考虑到名字和姓氏的多样性，这并不能保证100％的准确性，并且仍然会收集一些结果不好。

这种问题没有真正的好的解决方案，没有进一步提示有关人的名字的代码。也许扫描消息的主体，以称呼或valediction部分将有所帮助。

来源

2011-11-25 16:58:43

走过字符串猜测基于名称字典的电子邮件名称？

回答

相关问题