2011-10-06 73 views
5

我试图为给定的单词生成所有可能的音节组合。识别什么是音节的过程在这里是不相关的,但它是产生所有组合的问题。我认为这可能是以我认为的几行的递归方式进行递归(尽管其他方式都可以),但我无法正常工作。谁能帮忙?从字符串中生成子串的组合

// how to test a syllable, just for the purpose of this example 
    bool IsSyllable(string possibleSyllable) 
    { 
     return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$"); 
    } 

    List<string> BreakIntoSyllables(string word) 
    { 
     // the code here is what I'm trying to write 
     // if 'word' is "misunderstand" , I'd like this to return 
     // => {"mis","und","er","stand"},{ "mis","un","der","stand"} 
     // and for any other combinations to be not included 
    } 
+0

您应该使用Trie为您的音节编目。或者你可以使用naiver解决方案:-) – xanatos

回答

4

开始尝试用这样的:

var word = "misunderstand"; 

Func<string, bool> isSyllable = 
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$"); 

var query = 
    from i in Enumerable.Range(0, word.Length) 
    from l in Enumerable.Range(1, word.Length - i) 
    let part = word.Substring(i, l) 
    where isSyllable(part) 
    select part; 

这将返回:

misunderstand-results

这是否有助于开头至少?


编辑:我想远一点关于这一问题的与这对夫妻的查询上来:

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     let e = t.Substring(n) 
     from g in (new [] { new [] { e } }).Concat(splitter(e)) 
     select (new [] { s }).Concat(g).ToArray(); 

var query = 
    from split in (new [] { new [] { word } }).Concat(splitter(word)) 
    where split.All(part => isSyllable(part)) 
    select split; 

现在query回报这样的:

misunderstanding-results2

让我知道如果那是现在钉了它。

+0

谢谢,这看起来很有希望,但它目前给我一个编译错误 - “'System.Collections.Generic.IEnumerable '不包含'StartWith'的定义,也没有扩展方法'从...开始'” 。 – mikel

+0

@mikel - 啊,对不起。我使用了Reactive Framework扩展方法。当我回到我的电脑时,我会解决它。 – Enigmativity

+0

@mikel - 我已将查询更改为使用标准运算符。 – Enigmativity

3

通常这种类型的问题是使用Tries解决。我将基于我的实施Trie How to create a trie in c#(但请注意,我已将其重写)。

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" }); 
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" }); 

var word = "unquestionable"; 

var parts = new List<List<string>>(); 

Split(word, 0, trie, trie.Root, new List<string>(), parts); 

// 

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts) 
{ 
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if). 
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if) 
    if (node.IsTerminal) 
    { 
     // Add the syllable to the current list of syllables 
     currentParts.Add(node.Word); 

     // "covered" the word with syllables 
     if (index == word.Length) 
     { 
      // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time. 
      parts.Add(new List<string>(currentParts)); 
     } 
     else 
     { 
      // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root. 
      Split(word, index, trie, trie.Root, currentParts, parts); 
     } 

     // Remove the syllable from the current list of syllables 
     currentParts.RemoveAt(currentParts.Count - 1); 
    } 

    // We have covered all the word with letters. No more work to do in this subiteration 
    if (index == word.Length) 
    { 
     return; 
    } 

    // Here we try to find the edge corresponding to the current character 

    TrieNode nextNode; 

    if (!node.Edges.TryGetValue(word[index], out nextNode)) 
    { 
     return; 
    } 

    Split(word, index + 1, trie, nextNode, currentParts, parts); 
} 

public class Trie 
{ 
    public readonly TrieNode Root = new TrieNode(); 

    public Trie() 
    { 
    } 

    public Trie(IEnumerable<string> words) 
    { 
     this.AddRange(words); 
    } 

    public void Add(string word) 
    { 
     var currentNode = this.Root; 

     foreach (char ch in word) 
     { 
      TrieNode nextNode; 

      if (!currentNode.Edges.TryGetValue(ch, out nextNode)) 
      { 
       nextNode = new TrieNode(); 
       currentNode.Edges[ch] = nextNode; 
      } 

      currentNode = nextNode; 
     } 

     currentNode.Word = word; 
    } 

    public void AddRange(IEnumerable<string> words) 
    { 
     foreach (var word in words) 
     { 
      this.Add(word); 
     } 
    } 
} 

public class TrieNode 
{ 
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>(); 
    public string Word { get; set; } 

    public bool IsTerminal 
    { 
     get 
     { 
      return this.Word != null; 
     } 
    } 
} 

word是你有兴趣,parts将包含可能的音节的名单列表(它可能会更正确,使之成为List<string[]>字符串,但它是很容易做到这一点。而不是parts.Add(new List<string>(currentParts));parts.Add(currentParts.ToArray());和更改所有的List<List<string>>List<string[]>

我会添加Enigmativity响应的变体全髋关节置换是theretically快于他的,因为它摒弃错误的音节立即代替滤波后后他们。如果你喜欢它,你应该给他+1,因为没有他的想法,这种差异t是不可能的。但请注意,它仍然是一个黑客。 “正确”的解决方案是使用特里(S):-)

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$"); 

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     (
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s) 
     let e = t.Substring(n) 
     let f = splitter(e) 
     from g in f 
     select (new[] { s }).Concat(g).ToArray() 
     ) 
     .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]); 

var parts = splitter(word).ToList(); 

的解释:

 from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s) 

我们计算一个字的所有可能的音节,从长度为1至的长度字 - 1并检查它是否是音节。我们直接删除了非音节。整个单词作为一个音节将在稍后检查。

 let e = t.Substring(n) 
     let f = splitter(e) 

我们搜索的字符串

 from g in f 
     select (new[] { s }).Concat(g).ToArray() 

的剩余部分的音节和我们链“当前”音节找到音节。请注意,我们正在创建许多无用的数组。如果我们接受IEnumerable<IEnumerable<string>>作为结果,我们可以拿走ToArray

(我们可以多排一起改写,删除许多let,像

 from g in splitter(t.Substring(n)) 
     select (new[] { s }).Concat(g).ToArray() 

但我们不会为清楚起见做)

我们串接“当前”的音节与音节发现。

 .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]); 

在这里,我们可以重建查询,以便于一点不使用此Concat和创建空数组,但是这将是一个有点复杂(我们可以把整个lambda函数作为isSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunction

在最后,如果整个单词是一个音节,我们将整个单词作为一个音节添加。否则,我们Concat一个空数组(所以没有Concat

0

你可能会遇到扩展这个问题,老实说,我不确定你的数据集有多大,但是基于一个简单的“这是一个音节吗?”的解决方案。您需要为每个单词调用大约0(n * n)的“音节检测”例程,其中n =单词中的字符数(如果这没有意义,这意味着对于大型数据集它可能会很慢!) 。这不会考虑检测算法的可扩展性,当您添加更多音节时,检测算法的可扩展性也可能会变慢。 。


我知道你已经说过你识别什么是音节或不是相关的过程,但是让我们说你可以改变它使它更像自动完成,即通过音节的开头并让它告诉你所有从这一点来说,可能的音节将更具可扩展性。如果性能失控,请用trie替换它。