2010-07-08 95 views
1

我编写了这个PHP函数,它接受任何文本/ html字符串并修剪它。优化了一个修剪字符串的php函数

例如:

gen_string("Hello, how are you today?",10); 

返回: 你好,怎么...

当函数的字符串的限制是一样的特殊字符这样的位置上出现问题如:á,ñ等...

在这种情况下:

gen_string("Helló my friend",5); 

返回:地狱...

如何解决这个问题的任何想法?这是目前的功能:

# string: advanced substr 
function gen_string($string,$min,$clean=false) { 
$text = trim(strip_tags($string)); 
if(strlen($text)>$min) { 
    $blank = strpos($text,' '); 
    if($blank) { 
    # limit plus last word 
    $extra = strpos(substr($text,$min),' '); 
    $max = $min+$extra; 
    $r = substr($text,0,$max); 
    if(strlen($text)>=$max && !$clean) $r=trim($r,'.').'...'; 
    } else { 
    # if there are no spaces 
    $r = substr($text,0,$min).'...'; 
    } 
} else { 
    # if original length is lower than limit 
    $r = $text; 
} 
return trim($r); 
} 

谢谢!

+0

您需要使用MBSTRING功能。特别是'mb_substr()'http:// php。net/mb_substr和'mb_strpos()'http://php.net/mb_strpos – 2010-07-08 18:00:17

+0

怪异...调用未定义的函数mb_strimwidth() - 我确实有PHP 5 – andufo 2010-07-08 18:12:56

回答

-1

为了您的return语句,你可以尝试:

return htmlspecialchars(trim($r)); 

编辑:我想你的代码,您所提供它,它为我跑了罚款,而不必使用htmlspecialchars()。这可能是由于在代码运行的页面的<head>中,charset被设置为UTF-8。所以,你的选择可能是设置页面的编码是这样的:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 

或如上使用htmlspecialchars()

+0

nope,没有工作。 – andufo 2010-07-08 18:13:19

+0

这个 来自半个Unicode字符,而htmlspecialchars()在这里没有帮助。 – hop 2010-07-08 18:15:59

+0

我很难过。您的原始代码在我的本地主机和生产环境中对我来说工作得非常好。我只能猜测寻找可能导致您原先发布的内容之外的问题的事情。 – greenie 2010-07-08 18:29:09

4

您应该使用multibyte string functions来正确处理unicode字符。

例如,您可以尝试使用mb_strimwidth将字符串截断为指定的长度。

+0

似乎是一个不错的选择,但我得到了“调用未定义的函数mb_strimwidth()”错误味精 - 我有PHP 5 – andufo 2010-07-08 18:15:38

+0

@andufo:你可能想看到有关启用多字节字符串函数的相关问题:http ://stackoverflow.com/questions/2294393/mb-substr-cant-be-used-on-php-5-2-6 – 2010-07-08 18:23:32

+0

你也可以利用iconv http://php.net/iconv – 2010-07-09 00:28:52

0

除了多字节的问题,也许你可以把它写短

function gen_string($str, $limit) { 
    if ($str >= strlen($limit)) 
     return $str; 
    $offset = -(strlen($str) - $limit); 
    return substr($str, 0, strrpos($str, ' ', $offset)).'...'; 
} 

这将限制字符串的长度,因此而不是超越极限的第一个字后削减它,它确保长度永远不会超过极限。

+0

这不修复字符错误问题=( – andufo 2010-07-08 18:15:06

+0

嗯,我真的不知道什么会导致你的问题,因为我在生产中使用上面的代码没有任何问题,我们使用űúőó字符:)我也检查了它,现在没问题本地主机。你确定在你的服务器设置中一切正常吗? – galambalazs 2010-07-08 18:20:23

1

您也可以采取不同的方法并利用PCRE正则表达式扩展的UTF-8功能(假设您的字符串 UTF-8!)。

function gen_string($string, $length) 
{ 
    $str = trim(strip_tags($string)); 
    $strlen = strlen(utf8_decode($str)); 
    // String is less than limit 
    if ($strlen <= $length) return $str; 
    // Shorten string, preserving whole "words" (non-whitespace) 
    preg_match('/^.{'.($length-1).'}\S*/su', $str, $match); 
    // Append ellipsis if needed (bytes length is OK to check) 
    if (strlen($match[0]) !== strlen($str)) $match[0] .= '...'; 
    return $match[0]; 
} 
0

strlen()不能用于UTF-8字符串,因为它也将计数的继续字符,这不应该被计数。

您可以用下面的代码试试:

define('PREG_CLASS_UNICODE_WORD_BOUNDARY', 
    '\x{0}-\x{2F}\x{3A}-\x{40}\x{5B}-\x{60}\x{7B}-\x{A9}\x{AB}-\x{B1}\x{B4}' . 
    '\x{B6}-\x{B8}\x{BB}\x{BF}\x{D7}\x{F7}\x{2C2}-\x{2C5}\x{2D2}-\x{2DF}' . 
    '\x{2E5}-\x{2EB}\x{2ED}\x{2EF}-\x{2FF}\x{375}\x{37E}-\x{385}\x{387}\x{3F6}' . 
    '\x{482}\x{55A}-\x{55F}\x{589}-\x{58A}\x{5BE}\x{5C0}\x{5C3}\x{5C6}' . 
    '\x{5F3}-\x{60F}\x{61B}-\x{61F}\x{66A}-\x{66D}\x{6D4}\x{6DD}\x{6E9}' . 
    '\x{6FD}-\x{6FE}\x{700}-\x{70F}\x{7F6}-\x{7F9}\x{830}-\x{83E}' . 
    '\x{964}-\x{965}\x{970}\x{9F2}-\x{9F3}\x{9FA}-\x{9FB}\x{AF1}\x{B70}' . 
    '\x{BF3}-\x{BFA}\x{C7F}\x{CF1}-\x{CF2}\x{D79}\x{DF4}\x{E3F}\x{E4F}' . 
    '\x{E5A}-\x{E5B}\x{F01}-\x{F17}\x{F1A}-\x{F1F}\x{F34}\x{F36}\x{F38}' . 
    '\x{F3A}-\x{F3D}\x{F85}\x{FBE}-\x{FC5}\x{FC7}-\x{FD8}\x{104A}-\x{104F}' . 
    '\x{109E}-\x{109F}\x{10FB}\x{1360}-\x{1368}\x{1390}-\x{1399}\x{1400}' . 
    '\x{166D}-\x{166E}\x{1680}\x{169B}-\x{169C}\x{16EB}-\x{16ED}' . 
    '\x{1735}-\x{1736}\x{17B4}-\x{17B5}\x{17D4}-\x{17D6}\x{17D8}-\x{17DB}' . 
    '\x{1800}-\x{180A}\x{180E}\x{1940}-\x{1945}\x{19DE}-\x{19FF}' . 
    '\x{1A1E}-\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B6A}' . 
    '\x{1B74}-\x{1B7C}\x{1C3B}-\x{1C3F}\x{1C7E}-\x{1C7F}\x{1CD3}\x{1FBD}' . 
    '\x{1FBF}-\x{1FC1}\x{1FCD}-\x{1FCF}\x{1FDD}-\x{1FDF}\x{1FED}-\x{1FEF}' . 
    '\x{1FFD}-\x{206F}\x{207A}-\x{207E}\x{208A}-\x{208E}\x{20A0}-\x{20B8}' . 
    '\x{2100}-\x{2101}\x{2103}-\x{2106}\x{2108}-\x{2109}\x{2114}' . 
    '\x{2116}-\x{2118}\x{211E}-\x{2123}\x{2125}\x{2127}\x{2129}\x{212E}' . 
    '\x{213A}-\x{213B}\x{2140}-\x{2144}\x{214A}-\x{214D}\x{214F}' . 
    '\x{2190}-\x{244A}\x{249C}-\x{24E9}\x{2500}-\x{2775}\x{2794}-\x{2B59}' . 
    '\x{2CE5}-\x{2CEA}\x{2CF9}-\x{2CFC}\x{2CFE}-\x{2CFF}\x{2E00}-\x{2E2E}' . 
    '\x{2E30}-\x{3004}\x{3008}-\x{3020}\x{3030}\x{3036}-\x{3037}' . 
    '\x{303D}-\x{303F}\x{309B}-\x{309C}\x{30A0}\x{30FB}\x{3190}-\x{3191}' . 
    '\x{3196}-\x{319F}\x{31C0}-\x{31E3}\x{3200}-\x{321E}\x{322A}-\x{3250}' . 
    '\x{3260}-\x{327F}\x{328A}-\x{32B0}\x{32C0}-\x{33FF}\x{4DC0}-\x{4DFF}' . 
    '\x{A490}-\x{A4C6}\x{A4FE}-\x{A4FF}\x{A60D}-\x{A60F}\x{A673}\x{A67E}' . 
    '\x{A6F2}-\x{A716}\x{A720}-\x{A721}\x{A789}-\x{A78A}\x{A828}-\x{A82B}' . 
    '\x{A836}-\x{A839}\x{A874}-\x{A877}\x{A8CE}-\x{A8CF}\x{A8F8}-\x{A8FA}' . 
    '\x{A92E}-\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}-\x{A9DF}' . 
    '\x{AA5C}-\x{AA5F}\x{AA77}-\x{AA79}\x{AADE}-\x{AADF}\x{ABEB}' . 
    '\x{D800}-\x{F8FF}\x{FB29}\x{FD3E}-\x{FD3F}\x{FDFC}-\x{FDFD}' . 
    '\x{FE10}-\x{FE19}\x{FE30}-\x{FE6B}\x{FEFF}-\x{FF0F}\x{FF1A}-\x{FF20}' . 
    '\x{FF3B}-\x{FF40}\x{FF5B}-\x{FF65}\x{FFE0}-\x{FFFD}'); 

function utf8_strlen($text) { 
    if (function_exists('mb_strlen')) { 
    return mb_strlen($text); 
    } 

    // Do not count UTF-8 continuation bytes. 
    return strlen(preg_replace("/[\x80-\xBF]/", '', $text)); 
} 

function utf8_truncate($string, $max_length, $wordsafe = FALSE, $add_ellipsis = FALSE, $min_wordsafe_length = 1) { 
    $ellipsis = ''; 
    $max_length = max($max_length, 0); 
    $min_wordsafe_length = max($min_wordsafe_length, 0); 

    if (utf8_strlen($string) <= $max_length) { 
    // No truncation needed, so don't add ellipsis, just return. 
    return $string; 
    } 

    if ($add_ellipsis) { 
    // Truncate ellipsis in case $max_length is small. 
    $ellipsis = utf8_substr('...', 0, $max_length); 
    $max_length -= utf8_strlen($ellipsis); 
    $max_length = max($max_length, 0); 
    } 

    if ($max_length <= $min_wordsafe_length) { 
    // Do not attempt word-safe if lengths are bad. 
    $wordsafe = FALSE; 
    } 

    if ($wordsafe) { 
    $matches = array(); 
    // Find the last word boundary, if there is one within $min_wordsafe_length 
    // to $max_length characters. preg_match() is always greedy, so it will 
    // find the longest string possible. 
    $found = preg_match('/^(.{' . $min_wordsafe_length . ',' . $max_length . '})[' . PREG_CLASS_UNICODE_WORD_BOUNDARY . ']/u', $string, $matches); 
    if ($found) { 
     $string = $matches[1]; 
    } 
    else { 
     $string = utf8_substr($string, 0, $max_length); 
    } 
    } 
    else { 
    $string = utf8_substr($string, 0, $max_length); 
    } 

    if ($add_ellipsis) { 
    $string .= $ellipsis; 
    } 

    return $string; 
} 

function utf8_substr($text, $start, $length = NULL) { 
    if (function_exists('mb_substr')) { 
    return $length === NULL ? mb_substr($text, $start) : mb_substr($text, $start, $length); 
    } 
    else { 
    $strlen = strlen($text); 
    // Find the starting byte offset. 
    $bytes = 0; 
    if ($start > 0) { 
     // Count all the continuation bytes from the start until we have found 
     // $start characters or the end of the string. 
     $bytes = -1; 
     $chars = -1; 
     while ($bytes < $strlen - 1 && $chars < $start) { 
     $bytes++; 
     $c = ord($text[$bytes]); 
     if ($c < 0x80 || $c >= 0xC0) { 
      $chars++; 
     } 
     } 
    } 
    elseif ($start < 0) { 
     // Count all the continuation bytes from the end until we have found 
     // abs($start) characters. 
     $start = abs($start); 
     $bytes = $strlen; 
     $chars = 0; 
     while ($bytes > 0 && $chars < $start) { 
     $bytes--; 
     $c = ord($text[$bytes]); 
     if ($c < 0x80 || $c >= 0xC0) { 
      $chars++; 
     } 
     } 
    } 
    $istart = $bytes; 

    // Find the ending byte offset. 
    if ($length === NULL) { 
     $iend = $strlen; 
    } 
    elseif ($length > 0) { 
     // Count all the continuation bytes from the starting index until we have 
     // found $length characters or reached the end of the string, then 
     // backtrace one byte. 
     $iend = $istart - 1; 
     $chars = -1; 
     $last_real = FALSE; 
     while ($iend < $strlen - 1 && $chars < $length) { 
     $iend++; 
     $c = ord($text[$iend]); 
     $last_real = FALSE; 
     if ($c < 0x80 || $c >= 0xC0) { 
      $chars++; 
      $last_real = TRUE; 
     } 
     } 
     // Backtrace one byte if the last character we found was a real character 
     // and we don't need it. 
     if ($last_real && $chars >= $length) { 
     $iend--; 
     } 
    } 
    elseif ($length < 0) { 
     // Count all the continuation bytes from the end until we have found 
     // abs($start) characters, then backtrace one byte. 
     $length = abs($length); 
     $iend = $strlen; 
     $chars = 0; 
     while ($iend > 0 && $chars < $length) { 
     $iend--; 
     $c = ord($text[$iend]); 
     if ($c < 0x80 || $c >= 0xC0) { 
      $chars++; 
     } 
     } 
     // Backtrace one byte if we are not at the beginning of the string. 
     if ($iend > 0) { 
     $iend--; 
     } 
    } 
    else { 
     // $length == 0, return an empty string. 
     return ''; 
    } 

    return substr($text, $istart, max(0, $iend - $istart + 1)); 
    } 
}