2010-10-13 154 views
2

This website提供了“Schinke拉丁语干扰算法”供下载以在Snowball干扰词系统中使用它。PHP中的Schinke拉丁语干扰算法

我想使用这种算法,但我不想使用雪球。

好东西:该页面上有some pseudocode,您可以将其转换为PHP函数。这是我已经试过:

<?php 
function stemLatin($word) { 
    // output = array(NOUN-BASED STEM, VERB-BASED STEM) 
    // DEFINE CLASSES BEGIN 
    $queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque'); 
    $suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u'); 
    $suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't'); 
    // DEFINE CLASSES END 
    $word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind 
    $word = str_replace('j', 'i', $word); // replace all <j> by <i> 
    $word = str_replace('v', 'u', $word); // replace all <v> by <u> 
    if (substr($word, -3) == 'que') { // if word ends with -que 
     if (in_array($word, $queWords)) { // if word is a queWord 
      return array($word, $word); // output queWord as both noun-based and verb-based stem 
     } 
     else { 
      $word = substr($word, 0, -3); // remove the -que 
     } 
    } 
    foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A) 
     if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix 
      $word = substr($word, 0, -strlen($suffixA)); // remove the suffix 
      break; // remove only one suffix 
     } 
    } 
    if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters 
    foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B) 
     if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix 
      switch ($suffixB) { 
       case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i> 
       case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi> 
       case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri> 
       default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix 
      } 
      break; // remove only one suffix 
     } 
    } 
    if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters 
    return array($nounBased, $verbBased); 
} 
?> 

我的问题:

1)请问这段代码正常工作?它是否遵循算法的规则?

2)您如何改进代码(性能)?

非常感谢您提前!

回答

2

不,你的函数不起作用,它包含语法错误。例如,您有未封闭的引号,并且您使用了错误的switch语法。

这是我重写的函数。由于该页面上的伪算法并不精确,我不得不做一些解释。我用本文提到的例子来解释它。

我也做了一些优化。第一个是我定义的单词和后缀数组static。因此,对这个功能的所有调用共享相同的阵列,这应该是良好的性能;)

此外,我调整了阵列,使他们可以使用更有效。我更改了$queWords数组,因此它可用于快速哈希表查找,而不是缓慢的in_array。此外,我已经保存了数组中后缀的长度。因此你不需要在运行时计算它们(这真的很慢)。我可能做了更小的优化。

我不知道这个代码有多快,但它应该快得多。此外,它现在适用于所提供的示例。

下面是代码:

<?php 
    function stemLatin($word) { 
     static $queWords = array(
      'atque'   => 1, 
      'quoque'  => 1, 
      'neque'   => 1, 
      'itaque'  => 1, 
      'absque'  => 1, 
      'apsque'  => 1, 
      'abusque'  => 1, 
      'adaeque'  => 1, 
      'adusque'  => 1, 
      'denique'  => 1, 
      'deque'   => 1, 
      'susque'  => 1, 
      'oblique'  => 1, 
      'peraeque'  => 1, 
      'plenisque'  => 1, 
      'quandoque'  => 1, 
      'quisque'  => 1, 
      'quaeque'  => 1, 
      'cuiusque'  => 1, 
      'cuique'  => 1, 
      'quemque'  => 1, 
      'quamque'  => 1, 
      'quaque'  => 1, 
      'quique'  => 1, 
      'quorumque'  => 1, 
      'quarumque'  => 1, 
      'quibusque'  => 1, 
      'quosque'  => 1, 
      'quasque'  => 1, 
      'quotusquisque' => 1, 
      'quousque'  => 1, 
      'ubique'  => 1, 
      'undique'  => 1, 
      'usque'   => 1, 
      'uterque'  => 1, 
      'utique'  => 1, 
      'utroque'  => 1, 
      'utribique'  => 1, 
      'torque'  => 1, 
      'coque'   => 1, 
      'concoque'  => 1, 
      'contorque'  => 1, 
      'detorque'  => 1, 
      'decoque'  => 1, 
      'excoque'  => 1, 
      'extorque'  => 1, 
      'obtorque'  => 1, 
      'optorque'  => 1, 
      'retorque'  => 1, 
      'recoque'  => 1, 
      'attorque'  => 1, 
      'incoque'  => 1, 
      'intorque'  => 1, 
      'praetorque' => 1, 
     ); 
     static $suffixesNoun = array(
      'ibus' => 4, 
      'ius' => 3, 
      'ae' => 2, 
      'am' => 2, 
      'as' => 2, 
      'em' => 2, 
      'es' => 2, 
      'ia' => 2, 
      'is' => 2, 
      'nt' => 2, 
      'os' => 2, 
      'ud' => 2, 
      'um' => 2, 
      'us' => 2, 
      'a' => 1, 
      'e' => 1, 
      'i' => 1, 
      'o' => 1, 
      'u' => 1, 
     ); 
     static $suffixesVerb = array(
      'iuntur' => 6, 
      'beris' => 5, 
      'erunt' => 5, 
      'untur' => 5, 
      'iunt' => 4, 
      'mini' => 4, 
      'ntur' => 4, 
      'stis' => 4, 
      'bor' => 3, 
      'ero' => 3, 
      'mur' => 3, 
      'mus' => 3, 
      'ris' => 3, 
      'sti' => 3, 
      'tis' => 3, 
      'tur' => 3, 
      'unt' => 3, 
      'bo'  => 2, 
      'ns'  => 2, 
      'nt'  => 2, 
      'ri'  => 2, 
      'm'  => 1, 
      'r'  => 1, 
      's'  => 1, 
      't'  => 1, 
     ); 

     $stems = array($word, $word); 

     $word = strtr(strtolower(trim($word)), 'jv', 'iu'); // trim, lowercase and j => i, v => u 

     if (substr($word, -3) == 'que') { 
      if (isset($queWords[$word])) { 
       return array($word, $word); 
      } 
      $word = substr($word, 0, -3); 
     } 

     foreach ($suffixesNoun as $suffix => $length) { 
      if (substr($word, -$length) == $suffix) { 
       $tmp = substr($word, 0, -$length); 

       if (isset($tmp[1])) 
        $stems[0] = $tmp; 
       break; 
      } 
     } 

     foreach ($suffixesVerb as $suffix => $length) { 
      if (substr($word, -$length) == $suffix) { 
       switch ($suffix) { 
        case 'iuntur': 
        case 'erunt': 
        case 'untur': 
        case 'iunt': 
        case 'unt': 
         $tmp = substr_replace($word, 'i', -$length, $length); 
        break; 
        case 'beris': 
        case 'bor': 
        case 'bo': 
         $tmp = substr_replace($word, 'bi', -$length, $length); 
        break; 
        case 'ero': 
         $tmp = substr_replace($word, 'eri', -$length, $length); 
        break; 
        default: 
         $tmp = substr($word, 0, -$length); 
       } 

       if (isset($tmp[1])) 
        $stems[1] = $tmp; 
       break; 
      } 
     } 

     return $stems; 
    } 

    var_dump(stemLatin('aquila')); 
    var_dump(stemLatin('portat')); 
    var_dump(stemLatin('portis')); 
+0

非常感谢你,这完美的作品! :) – caw 2010-10-16 19:53:07

2

据我所知,这遵循链接中描述的算法,并应正常工作。 (除了你在的定义中的语法错误 - 你错过了一些撇号。)

从表现上看,它看起来并不像这里有很多东西,但有一些东西想到这些。

如果在脚本执行过程中多次调用这个函数,可能会在函数之外定义这些数组获得一些结果 - 我不认为PHP足够聪明,可以在调用之间缓存这些数组到功能。

您也可以结合这两个str_replace s转换之一:$word = str_replace(array('j','v'), array('i','u'), $word);,或者,因为你与单个字符替换单个字符,你可以使用$word = strtr($word,'jv','iu'); - 但我不认为这会使得在实践太大的差别。你必须尝试确定。

+0

,谢谢你的建议 - 他们都似乎是正确的和有益的:) – caw 2010-10-16 19:52:44