PHP的HTML截断和UTF-8

我需要截断字符串指定长度忽略HTML标记。我找到了合适的功能here。PHP的HTML截断和UTF-8

所以，我提出的光更改，添加缓冲器输入ob_start();

问题是与UTF-8。如果截断字符串的最后一个符号来自间隔[±，č，è，ė，į，š，ø，ü，ü，ž]，则我在字符串的末尾得到替换字符U + FFFD 。

这是我的代码。您可以复制，粘贴，并通过自己尝试：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
<title>String truncate</title> 
</head> 

<?php 

    $html = '<b>Koks nors tekstas</b>. <p>Lietuviškas žodis.</p>'; 

    $html = html_truncate(27, $html); 

    echo $html; 

    /* Truncate HTML, close opened tags 
    * 
    * @param int, maxlength of the string 
    * @param string, html  
    * @return $html 
    */ 
    function html_truncate($maxLength, $html){ 

     $printedLength = 0; 
     $position = 0; 
     $tags = array(); 

     ob_start(); 

     while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){ 

      list($tag, $tagPosition) = $match[0]; 

      // Print text leading up to the tag. 
      $str = substr($html, $position, $tagPosition - $position); 
      if ($printedLength + strlen($str) > $maxLength){ 
       print(substr($str, 0, $maxLength - $printedLength)); 
       $printedLength = $maxLength; 
       break; 
      } 

      print($str); 
      $printedLength += strlen($str); 

      if ($tag[0] == '&'){ 
       // Handle the entity. 
       print($tag); 
       $printedLength++; 
      } 
      else{ 
       // Handle the tag. 
       $tagName = $match[1][0]; 
       if ($tag[1] == '/'){ 
        // This is a closing tag. 

        $openingTag = array_pop($tags); 
        assert($openingTag == $tagName); // check that tags are properly nested. 

        print($tag); 
       } 
       else if ($tag[strlen($tag) - 2] == '/'){ 
        // Self-closing tag. 
        print($tag); 
       } 
       else{ 
        // Opening tag. 
        print($tag); 
        $tags[] = $tagName; 
       } 
      } 

      // Continue after the tag. 
      $position = $tagPosition + strlen($tag); 
     } 

     // Print any remaining text. 
     if ($printedLength < $maxLength && $position < strlen($html)) 
      print(substr($html, $position, $maxLength - $printedLength)); 

     // Close any open tags. 
     while (!empty($tags)) 
      printf('</%s>', array_pop($tags)); 


     $bufferOuput = ob_get_contents(); 

     ob_end_clean();   

     $html = $bufferOuput; 

     return $html; 

    } 

?> 

<body> 
</body> 
</html>

此函数的结果是这样的：

Koks的NORS tekstas。

任何想法为什么这个函数搞乱了UTF-8？

来源

2011-11-22 Bounce

可能重复你的变量]（http://stackoverflow.com/questions/6288875/utf-8-compatible-truncate-function） – user