2012-03-20 111 views
1

为什么Unicode有几个保留字符代码?
请参阅Unicode的两种语言 - KannadaTamil。 这两种语言都很古老,我认为没有机会获得这些语言的新字符。
编辑︰然后,他们为什么浪费一些字符代码,使其保留字符代码?
他们为什么不在每个语言字符集的末尾放置保留字符代码?Unicode中的保留字符代码

+0

我知道你很好奇,但是你还有另外一个原因吗? – 2012-03-20 16:20:05

+0

请解释一下:你的意思是问为什么这些块中有未分配的插槽? – tchrist 2012-03-20 16:21:35

+2

@Oded我认为你误解了他的问题,因为你的问题是一个*非sequitur *。我不确定它甚至是不合时宜的。 – tchrist 2012-03-20 16:22:35

回答

3

这与Unicode联盟如何分配其分配的块,脚本和代码点有关。例如,在Block=Tamil,它开始运行这种方式:

$ unichars '\p{Block=Tamil}' | head -20 
U+00B82 ‭ ◌ஂ GC=Mn SC=Tamil  TAMIL SIGN ANUSVARA 
U+00B83 ‭ ஃ GC=Lo SC=Tamil  TAMIL SIGN VISARGA 
U+00B85 ‭ அ GC=Lo SC=Tamil  TAMIL LETTER A 
U+00B86 ‭ ஆ GC=Lo SC=Tamil  TAMIL LETTER AA 
U+00B87 ‭ இ GC=Lo SC=Tamil  TAMIL LETTER I 
U+00B88 ‭ ஈ GC=Lo SC=Tamil  TAMIL LETTER II 
U+00B89 ‭ உ GC=Lo SC=Tamil  TAMIL LETTER U 
U+00B8A ‭ ஊ GC=Lo SC=Tamil  TAMIL LETTER UU 
U+00B8E ‭ எ GC=Lo SC=Tamil  TAMIL LETTER E 
U+00B8F ‭ ஏ GC=Lo SC=Tamil  TAMIL LETTER EE 
U+00B90 ‭ ஐ GC=Lo SC=Tamil  TAMIL LETTER AI 
U+00B92 ‭ ஒ GC=Lo SC=Tamil  TAMIL LETTER O 
U+00B93 ‭ ஓ GC=Lo SC=Tamil  TAMIL LETTER OO 
U+00B94 ‭ ஔ GC=Lo SC=Tamil  TAMIL LETTER AU 
U+00B95 ‭ க GC=Lo SC=Tamil  TAMIL LETTER KA 
U+00B99 ‭ ங GC=Lo SC=Tamil  TAMIL LETTER NGA 
U+00B9A ‭ ச GC=Lo SC=Tamil  TAMIL LETTER CA 
U+00B9C ‭ ஜ GC=Lo SC=Tamil  TAMIL LETTER JA 
U+00B9E ‭ ஞ GC=Lo SC=Tamil  TAMIL LETTER NYA 
U+00B9F ‭ ட GC=Lo SC=Tamil  TAMIL LETTER TTA 

他们往往保留的4,8,或16码点的连续行的性格都是一样的“厚道”。是的,那里存在差距,但是就像文件系统中的情况一样,一旦将一个扇区分配给一个文件(或者在块中没有单独的扇区的情况下将其封锁),即使该文件没有使用其中的所有文件(最后)部分,你不会将这些未使用的字节分配给其他进程。无论如何,事情往往会被填充以阻止边界。

这不像我们有任何冒险的代码风险。

这是分配区域的开始以“符号”开始,如该块中第一个分配的代码点所示。差距可能代表一种角色向另一种角色的转变。如果你在为他们的属性块检查出前五码点,你看那些未分配的代码点仍然有正确的块属性:

$ uniprops -a U+00B80 U+00B81 U+00B82 U+00B83 U+00B84 U+00B85 
U+0B80 ‹U+0B80› \N{U+0B80} 
    \pC \p{Cn} 
    All Any InTamil C Other Cn Unassigned Zzzz Unknown 
    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX 
     Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX 
U+0B81 ‹U+0B81› \N{U+0B81} 
    \pC \p{Cn} 
    All Any InTamil C Other Cn Unassigned Zzzz Unknown 
    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX 
     Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX 
U+0B82 ‹◌ஂ› \N{TAMIL SIGN ANUSVARA} 
    \w \pM \p{Mn} 
    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil Case_Ignorable CI M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC 
     Mark Nonspacing_Mark Print Taml Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 
    Age=1.1 Bidi_Class=Nonspacing_Mark BC=NSM Bidi_Class=NSM Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=EX 
     Grapheme_Cluster_Break=Extend GCB=EX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 
     Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
     Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=EX Sentence_Break=Extend SB=EX Word_Break=Extend WB=Extend 
U+0B83 ‹ஃ› \N{TAMIL SIGN VISARGA} 
    \w \pL \p{L_} \p{Lo} 
    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter 
     L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 
    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR 
     Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 
     Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
     Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE 
     Word_Break=LE 
U+0B84 ‹U+0B84› \N{U+0B84} 
    \pC \p{Cn} 
    All Any InTamil C Other Cn Unassigned Zzzz Unknown 
    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX 
     Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX 
U+0B85 ‹அ› \N{TAMIL LETTER A} 
    \w \pL \p{L_} \p{Lo} 
    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter 
     L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 
    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR 
     Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 
     Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
     Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE 
     Word_Break=LE 

如果你看看其他分配的内存块,你会看到相同的排序的东西。把块分成不相关的东西是没有意义的。

正如我所说的那样,它并不像他们将要用尽空间,所以我不知道这里关注的是什么。

顺便说一句,你可以从我Unicode Command-Line Toolchest得到Unicode的探索和proceesing工具,如unicharsunipropsuninames,无论是从那里单独或可通过CPAN Unicode::Tussle suite整个套件。

相关问题