在Perl中驼峰（WikiWord）Utf8正确的正则表达式

这里有一个关于CamelCase regex的问题。结合tchrist post我想知道什么是正确UTF-8驼峰。在Perl中驼峰（WikiWord）Utf8正确的正则表达式

与（布莱恩·d FOY的）正则表达式开始：

/ 
    \b   # start at word boundary 
    [A-Z]  # start with upper 
    [a-zA-Z]* # followed by any alpha 

    (?: # non-capturing grouping for alternation precedence 
     [a-z][a-zA-Z]*[A-Z] # next bit is lower, any zero or more, ending with upper 
      |      # or 
     [A-Z][a-zA-Z]*[a-z] # next bit is upper, any zero or more, ending with lower 
    ) 

    [a-zA-Z]* # anything that's left 
    \b   # end at word 
/x

和修改到：

/ 
    \b   # start at word boundary 
    \p{Uppercase_Letter}  # start with upper 
    \p{Alphabetic}*   # followed by any alpha 

    (?: # non-capturing grouping for alternation precedence 
     \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter} ### next bit is lower, any zero or more, ending with upper 
      |     # or 
     \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter} ### next bit is upper, any zero or more, ending with lower 
    ) 

    \p{Alphabetic}*   # anything that's left 
    \b   # end at word 
/x

有线路问题标记为 '###'。

此外，如何修改正则表达式时，假定比数字和下划线等价于小写字母，所以W2X3是一个有效的驼峰字。

更新时间：（YSTH评论）

下一个，

any：意思是 “大写或小写字母或数字或下划线”

正则表达式应该匹配CamelWord， CaW

开始用大写字母
可选任何
小写字母或数字或下划线
可选任何
大写字母
可选任何

请，不标记为重复，因为它不是。 original question（和答案）只认为ascii。

来源

2011-06-12 jm666

别名也就是说，你已经开始与一个真正奇怪的正则表达式;我认为它与简单的'/ \ b [AZ] + [az] [A-Za-z] * \ b /'不同，它与任何不同的东西都不相同（一个“单词”仅由字母组成，以大写字母并包括至少一个小写字母）（更新：我错了，原始正则表达式至少需要三个字母。） – ysth 2011-06-12 16:25:14

无论如何，请不要以ASCII正则表达式开头;开始尽可能准确定义你想要匹配什么 – ysth 2011-06-12 16:29:01

更新了问题 - （我希望是足够的）精确定义 – jm666 2011-06-12 17:02:57

我真的不知道你想要做什么，但这应该更接近你原来的意图。不过，我仍然无法分辨你的意思。

m{ 
    \b 
    \p{Upper}  # start with uppercase code point (NOT LETTER) 

    \w*   # optional ident chars 

    # note that upper and lower are not related to letters 
    (?: \p{Lower} \w* \p{Upper} 
     | \p{Upper} \w* \p{Lower} 
    ) 

    \w* 

    \b 
}x

千万不要使用[a-z]。而实际上，不要使用\p{Lowercase_Letter}或\p{Ll}，因为那些不是更理想和更正确的\p{Lowercase}和\p{Lower}。

请记住，\w实际上只是

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

来源

2011-06-12 18:19:13 tchrist

为什么'小写字母'和'下部'更可取？（即它们包括“Ll”不包含的内容）“小写”和“下”（如果有）之间的区别是什么？ – ikegami 2011-06-12 21:32:11

@ikegami：'Lowercase'和'Lower'是相同的，是'GC = Lowercase_Letter'和'Other_Lowercase = True'的联合。有201个代码点或者是'Lower'*，但不是*'GC = Ll'，否则是'Upper' *，而不是''GC = Lu'。这些包括'GC = Mn'，'GC = Lm'，'GC = N1'和'GC = So'码点。 ***对不起，我真的以为这是现在所有的常识！***运行'unichars -gs'/（？= \ P {Ll}）\ p {下}/x || /（？= \ P {Lu}）\ p {Upper}/x'| ucsort --upper-before-lower | cat -n |少看我的意思。这些程序在我的[unicode toolchest]（http://training.perl.com/scripts/）中。 – tchrist 2011-06-12 23:36:07

@tchrist - 到unicode工具集的链接已经失效（至少现在）。任何替代品？ – jm666 2014-05-15 15:09:36

在Perl中驼峰（WikiWord）Utf8正确的正则表达式

回答

相关问题