如何修剪string(6) " page"
,其中第一个空格是0xc2a0的非破坏空间?我试过trim()
和preg_match('/^\s*(.*)\s*$/u', $key, $m);
。在PHP 5.2中修剪unicode空格
另一个问题:如何可靠地复制这些字符?它们似乎被转换为“正常”空间,这使得它很难调试。
如何修剪string(6) " page"
,其中第一个空格是0xc2a0的非破坏空间?我试过trim()
和preg_match('/^\s*(.*)\s*$/u', $key, $m);
。在PHP 5.2中修剪unicode空格
另一个问题:如何可靠地复制这些字符?它们似乎被转换为“正常”空间,这使得它很难调试。
preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$str);
不幸的是,在我找到答案之前,我自己找到了答案。但是你忘了'u'修饰符,'s'没什么意义。工作代码:'preg_replace('/^\ p {Z} + | \ p {Z} + $/u','',$ key)'。 – Znarkus 2010-11-12 17:59:30
已更新至'preg_replace('/^[\ pZ \ pC] + | [\ pZ \ pC] + $/u''。 – Znarkus 2010-11-12 18:11:14
@Znarkus:PHP不支持或至少不鼓励使用非疯狂 - 通过'/ x'或'(?x)'表示正则表达式的模式 – tchrist 2010-11-13 13:57:18
也许来自多字节字符串函数集? http://php.net/manual/en/function.mb-ereg.php看不到mb_trim,但有一组MB安全正则表达式函数。
摹
preg_ *和u选项(**不是** U)处理unicode时显示为utf-8和Â中的空间。 – 2010-11-12 17:04:28
PCRE unicode properties性能,可用于实现这一
这里是我打了,似乎你想要做什么代码:
<?php
function unicode_trim ($str) {
return preg_replace('/^[\pZ\pC]+([\PZ\PC]*)[\pZ\pC]+$/u', '$1', $str);
}
$key = chr(0xc2) . chr(0xa0) . '#page#' . chr(0xc2) . chr(0xa0);
var_dump(unicode_trim($key));
结果
[~]> php e.php
string(6) "#page#"
说明:
\ p {XX} 一个与xx属性 \ p {XX} 字符而不xx属性字符
如果XX只有一个字符,然后{}可以丢弃,例如\ p {Z}是一样\ PZ
Z代表所有分离器,C表示所有的 “其他” 的字符(例如控制字符)
那将会删除任何非字母字符(#!%123 ...) – 2010-11-12 17:06:02
你是对的,我现在更新了正则表达式 – 2010-11-12 17:34:04
对不起,没有没有修剪尾部空格 – Znarkus 2010-11-12 17:56:50
此网页可能帮助:
http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_characters_web_page
太棒了,谢谢!所以这也应该删除那些可怕的苹果标志字符?优秀:-) – Znarkus 2010-11-12 18:10:42
现有解决方案只提及\pZ
个字符。不过,也有落在该财产的范围之外的6个Unicode的空格字符:
% unichars '\p{WhiteSpace}' '\PZ'
-- 9 0009 CHARACTER TABULATION
-- 10 000A LINE FEED (LF)
-- 11 000B LINE TABULATION
-- 12 000C FORM FEED (FF)
-- 13 000D CARRIAGE RETURN (CR)
-- 133 0085 NEXT LINE (NEL)
这六个都\pC
型的,特别是,类型\p{Cc}
。但是也有59,同时也是\p{Cc}
非空白字符:
% unichars '\P{WhiteSpace}' '\p{Cc}' | wc -l
59
简单的版本我自己的测试的东西是否是可打印字符或不简直是[\pZ\pC]
;例如,这就是unichars
使用的内容。
更仔细的测试会考虑是否应该占用0,1或2个打印位置。这需要考虑它是否是一个组合标记,属性\pM
,以及它是否具有半宽或全宽属性。例如:
% uniprops ff5e ffeb
U+FF5E ‹~› \N{ FULLWIDTH TILDE }:
\pS \p{Sm}
All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
Math_Symbol Print Symbol
U+FFEB ‹→› \N{ HALFWIDTH RIGHTWARDS ARROW }:
\pS \p{Sm}
All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
Math_Symbol Print Symbol
对于这些,您将需要使用非二进制东亚宽度属性。这些都适用:
% uniprops -l | grep -i width
Block:Halfwidth_And_Fullwidth_Forms
InHalfwidthAndFullwidthForms
East_Asian_Width:A
East_Asian_Width=Ambiguous
East_Asian_Width:Ambiguous
East_Asian_Width:F
East_Asian_Width=Fullwidth
East_Asian_Width:Fullwidth
East_Asian_Width:H
East_Asian_Width=Halfwidth
East_Asian_Width:Halfwidth
East_Asian_Width=Neutral
East_Asian_Width:Na
East_Asian_Width=Narrow
East_Asian_Width:Narrow
East_Asian_Width:Neutral
East_Asian_Width:W
East_Asian_Width=Wide
East_Asian_Width:Wide
那些有像\p{Ea=F}
和\p{Ea=H}
缩写。有这些一堆:
% uninames '(FULL|HALF)WIDTH' | wc -l
454
当然,你不能去上这些东西的名字,但在性能:
% unichars '[\p{Ea=F}\p{Ea=H}]' | wc -l
227
% unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}]' | wc -l
338
% unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}\pM]' | wc -l
1488
要告诉你多少,许多这些特性事情真的有,这里是三个不同字符的完整属性转储,运行在Unicode 5.2上:
% uniprops -ga NEL "COMBINING TILDE" ff5e
U+0085 ‹U+0085› \N{ NEXT LINE (NEL) }:
\s \v \R \pC \p{Cc}
All Any Assigned InLatin1 C Other Cc Cntrl Common Zyyy Control Pat_WS Pattern_White_Space PatWS Space SpacePerl VertSpace
White_Space WSpace
Age:1.1 Bidi_Class:B Bidi_Class=Paragraph_Separator Bidi_Class:Paragraph_Separator Bc=B Block:Latin_1
Block=Latin_1_Supplement Block:Latin_1_Supplement Blk=Latin1 General_Category=Other Canonical_Combining_Class:0
Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR
General_Category=Control Script=Common Decomposition_Type:None Dt=None East_Asian_Width=Neutral East_Asian_Width:Neutral
General_Category:C General_Category:Cc General_Category:Cntrl General_Category:Control Gc=Cc General_Category:Other Gc=C
Grapheme_Cluster_Break:CN Grapheme_Cluster_Break=Control Grapheme_Cluster_Break:Control GCB=CN Hangul_Syllable_Type:NA
Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:Next_Line Lb=NL
Line_Break:NL Line_Break=Next_Line Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:SE Sentence_Break=Sep Sentence_Break:Sep SB=SE Word_Break:Newline WB=NL
Word_Break:NL Word_Break=Newline
U+0303 ‹̃› \N{ COMBINING TILDE }:
\w \pM \p{Mn}
All Any Assigned InCombiningDiacriticalMarks Case_Ignorable CI Dia Diacritic M Mn Gr_Ext Grapheme_Extend Graph GrExt
ID_Continue IDC Inherited Zinh Mark Nonspacing_Mark Print Qaai Word XID_Continue XIDC
Age:1.1 Bidi_Class:Nonspacing_Mark Bc=NSM Bidi_Class:NSM Bidi_Class=Nonspacing_Mark Block:Combining_Diacritical_Marks
Canonical_Combining_Class:230 Canonical_Combining_Class=Above Canonical_Combining_Class:A
Canonical_Combining_Class:Above Ccc=A Decomposition_Type:None Dt=None East_Asian_Width:A East_Asian_Width=Ambiguous
East_Asian_Width:Ambiguous Ea=A General_Category:M General_Category=Mark General_Category:Mark Gc=M General_Category:Mn
General_Category=Nonspacing_Mark General_Category:Nonspacing_Mark Gc=Mn Grapheme_Cluster_Break:EX
Grapheme_Cluster_Break=Extend Grapheme_Cluster_Break:Extend GCB=EX Hangul_Syllable_Type:NA
Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Script=Inherited
Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:T Joining_Type=Transparent Joining_Type:Transparent Jt=T
Line_Break:CM Line_Break=Combining_Mark Line_Break:Combining_Mark Lb=CM NFC_Quick_Check:M NFC_Quick_Check=Maybe
NFC_Quick_Check:Maybe NFCQC=M NFKC_Quick_Check:M NFKC_Quick_Check=Maybe NFKC_Quick_Check:Maybe NFKCQC=M
Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1
In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1
Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Inherited Sc=Zinh Script:Qaai Script:Zinh
Sentence_Break:EX Sentence_Break=Extend Sentence_Break:Extend SB=EX Word_Break:Extend WB=Extend
U+FF5E ‹~› \N{ FULLWIDTH TILDE }:
\pS \p{Sm}
All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base
Graph GrBase Math Math_Symbol Print Symbol
Age:1.1 Bidi_Class:ON Bidi_Class=Other_Neutral Bidi_Class:Other_Neutral Bc=ON Block:Halfwidth_And_Fullwidth_Forms
Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR
Canonical_Combining_Class:NR Script=Common Decomposition_Type:Non_Canon Decomposition_Type=Non_Canonical
Decomposition_Type:Non_Canonical Dt=NonCanon Decomposition_Type:Wide Dt=Wide East_Asian_Width:F
East_Asian_Width=Fullwidth East_Asian_Width:Fullwidth Ea=F General_Category:Math_Symbol Gc=Sm General_Category:S
General_Category=Symbol General_Category:Sm General_Category=Math_Symbol General_Category:Symbol Gc=S
Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:ID
Line_Break=Ideographic Line_Break:Ideographic Lb=ID Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1
Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2
In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:Other SB=XX Sentence_Break:XX Sentence_Break=Other Word_Break:Other
WB=XX Word_Break:XX Word_Break=Other
很漂亮,呃?
如果你已经远远阅读并想知道要上面所示的三种Unicode公用事业,其中,uniprops
,unichars
和uninames
,请给我发邮件,因为目前的链接现在还没有工作。
其他/格式块中还有一些东西 - 蒙古语元音分隔符,零宽度(空格/木匠/非木匠),单词木匠和零宽度非破碎空间。不管他们是否可拆分,都会引发大量争论。 – 2015-08-17 20:39:16
但是这里是我唯一的解决办法,因为有时有UTF8空间:
$stringg = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$stringg);
$stringg = preg_replace('/\s+/u', '', $stringg);
它是做什么的?修剪Unicode空间,然后删除字符串中的所有空格? – Znarkus 2014-03-25 06:33:32
你是什么意思与“复制”?从哪里复制什么?从网页到IDE /编辑器? – 2010-11-12 17:05:26
@antil我会从浏览器说。我有一个类似的问题和铬,火狐等,只是在ISO-8859-1 – 2011-01-08 16:44:36