2014-09-27 51 views
2

如何计算O(n)中后续子字符串中的匹配字符数。子串是通过从一开始就一次删除一个字符而形成的。如何计算O(n)中后续子字符串中的匹配字符数

例如:定的字符串是否ababcabab,预期结果是8

  • Substr1:babcabab计数:0

  • Substr2:abcabab计数:2个作为第一两个字符与输入的原字符串相匹配,第3个字符不匹配,所以检查匹配是否停止

  • 子3:bcabab计数:0

  • SubStr4:cabab计数:0

  • SubStr5:abab计数:4

  • SubStr6:bab计数:0

  • Substr7:ab计数:2

  • SubStr8: b计数:0

结果预期:2 + 4 + 2 = 8

回答

0

使用一个for循环(在这个例子中的java):

String s = "ababcabab"; 
int count = 0; 
    int count = 0; 
    for(int i = 1; i < s.length(); i++){ // for loop for all substrings [EDIT]: starts w/ 1 instead of 0. Thanks to vincent 
     String sub = s.substring(i); 
     for(int j = 0; j < sub.length() && sub.toCharArray()[j] == s.toCharArray()[j]; j++) /note that for & while loops in java are very similar. stops when substring doesn't match anymore **OR** substring's end is reached 
     { 
      count++; // increases count for every matching char in substring in a row 
     } 
    } 
    System.out.println("The count is: " + count); 
+4

这看起来很'O(n^2)'给我...... – 2014-09-27 14:32:35

0

我们可以通过绘制一些逻辑结论在O(n)的解决这个问题:由于所有的比赛都是一样的,也就是说,它们与字符串本身相匹配;从字符串索引i开始的任何匹配将包含在i(或长度允许的部分)之前开始的所有匹配。此外,任何长度大于其起始索引的匹配都将包含字符串起始部分到匹配起始部分的重复。我们只需要记录完整的匹配,我们可以在一次遍历的字符串中找到匹配,而不需要退后一步,并推导出其余的结果。

实例(非从零开始):

"aaaaaa": 
Starting on index 2, we have a match length 5. This match necessarily includes 
a match of length 4 starting on index 3 (since index 3 is index 2 for the 
substring that starts on index 2). Continuing the same logic, we add 3 + 2 + 1 
for a total of 15, without needing to scan and compare more than Substr2. 

"aabaabaa": 
Starting on index 2, we have a match length 1. 
Starting on index 4, we have a match length 5. This match necessarily includes 
a match of length 1 starting on index 5 (since index 5 is index 2 for the 
substring that starts on index 4). It also necessarily includes a match of 
length (5 - 3) starting on index 7 (since index 7 is index 4 for the substring 
that starts on index 4), and this match implies another match of length 1, 
starting on index 8. Altogether 1 + 5 + 1 + (5 - 3) + 1 = 10. Again, the scan 
was O(n). 

"aabaabaabaabaa": 
Starting on index 2, we have a match length 1. 
Starting on index 4, we have a match length 11. 
1 + 11 + 1 + (11 - 3) + 1 + (8 - 3) + 1 + (5 - 3) + 1 = 31. 

"aabaaab": 
Starting on index 2, we have a match length 1. 
For repeated patterns in the beginning of the string, we can use a formula 
rather than multiple scans, so a string like "aabaaaaaaaaaab" would have the 
same complexity as the one above, (number of times the pattern repeats - number 
of times the pattern repeats in the beginning of the string) * total length of 
repeated pattern at the start of the string. We identify a pattern if the 
length of the first match is a multiple of its starting index. Identifying 
this pattern and using the formula also prevents erroneously missing the 
correct match to record (e.g., without it, we would have identified 'aa' and 
'a' at the end as matches and missed the 'aab'). 
So starting on index 4, we have (3 - 2) * 2 = 2 
Starting on index 5, we have a match length 3. 
1 + 2 + 3 + 1 = 7 

"ababcabab": 
Starting on index 3, we have a match length 2. 
Starting on index 6, we have a match length 4. 
2 + 4 + 2 = 8 
4

可以创建在O(n)的后缀阵列(和LCP阵列)与Ukkonen's algorithm,则变得微不足道与为O另一个通(找到它n),总结原始字符串的LCP值:

LCP SA suffix 
    0 9 . 
    0 7 ab. 
> 2 5 abab. 
> 4 0 ababcabab. 
> 2 2 abcabab. 
    0 8 b. 
    1 6 bab. 
    3 1 babcabab. 
    1 3 bcabab. 
    0 4 cabab. 
    0 0 ababcabab.