xml属性/标记中的unicode字符无效

xml属性（标记）中的无效unicode字符列表是什么？xml属性/标记中的unicode字符无效

如下面的python3代码说明：

import xml.etree.ElementTree as ET 
from io import StringIO as sio 

xml_dec = '<?xml version="1.1" encoding="UTF-8"?>' 
unicode_text = '<root>textº</root>' 
valid_unicode = '<标签 属性="值">文字</标签>' 
invalid_unicode_attribute = '<tag attributeº="value">text</tag>' 
invalid_unicode_tag = '<tagº>text</tagº>' 

ET.parse(sio(xml_dec + unicode_text)) 
# works 

ET.parse(sio(xml_dec + valid_unicode)) 
# works 

ET.parse(sio(xml_dec + invalid_unicode_attribute)) 
# ParseError 

ET.parse(sio(xml_dec + invalid_unicode_tag)) 
# ParseError

的unicode字符º，即U+00BA，可以如果是在该元件的文字，而不是在元件属性或者标签进行解析。另一方面，可以在元素属性和标签中解析其他Unicode字符，例如中文字符。

我检查了XML <?xml version="1.1" encoding="UTF-8"?><tagº>text</tagº>在https://validator.w3.org/check，并给出了错误：

Line 1, Column 43: character "º" not allowed in attribute specification list

然而，在XML Recommendation 1.1, §2.2 Characters，它说，这是允许的：

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

我的问题是，在那里我可以找到XML属性/标签中的无效unicode字符列表？

来源

2017-04-06 azalea

这是关于属性现在或标签名称？标题和最后一句谈论属性，但这些示例仅关于文本和标签。 – lenz

无论如何，你只需要在自己链接的文档中滚动一下。例如，[here]（https://www.w3.org/TR/xml11/#NT-NameStartChar）是您可以在标签名称中使用哪些字符的定义。 – lenz

如果掌握了术语，你会发现获得这些问题的答案更容易。这是一个标签的例子：''。它包含两个名称（一个元素名称和一个属性名称）以及其他各种东西，包括属性值，空格，等号，撇号等等。我认为您的问题不是关于标记中允许使用什么字符，而是关于哪些字符在元素名称和属性名称中是允许的。 –

有关允许在标签文字和属性名的 W3C recommendation（要联系自己 - 但你在看什么可以在文本节点中使用的定义），规定如下：

Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.

和

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.

其次是一个formal definition其中列出了很多的Unicode范围：

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | 
        [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | 
        [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | 
        [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 
        [#x10000-#xEFFFF] 
NameChar  ::= NameStartChar | "-" | "." | [0-9] | #xB7 | 
        [#x0300-#x036F] | [#x203F-#x2040] 
Name   ::= NameStartChar (NameChar)*

阳性序数指示器º（#xBA）不在其中，无论出于何种原因（至少某些语言在缩写中使用它来表示常见词汇，所以它看起来不像“分隔符”）。

看到您可以在标签名称中使用数字，连字符和句点，但不能作为第一个字符也很有趣。

来源

2017-04-06 19:49:58 lenz

回答

相关问题