python
  • xml
  • unicode
  • utf-8
  • 2017-04-06 57 views 1 likes 
    1

    xml属性(标记)中的无效unicode字符列表是什么?xml属性/标记中的unicode字符无效

    如下面的python3代码说明:

    import xml.etree.ElementTree as ET 
    from io import StringIO as sio 
    
    xml_dec = '<?xml version="1.1" encoding="UTF-8"?>' 
    unicode_text = '<root>textº</root>' 
    valid_unicode = '<标签 属性="值">文字</标签>' 
    invalid_unicode_attribute = '<tag attributeº="value">text</tag>' 
    invalid_unicode_tag = '<tagº>text</tagº>' 
    
    ET.parse(sio(xml_dec + unicode_text)) 
    # works 
    
    ET.parse(sio(xml_dec + valid_unicode)) 
    # works 
    
    ET.parse(sio(xml_dec + invalid_unicode_attribute)) 
    # ParseError 
    
    ET.parse(sio(xml_dec + invalid_unicode_tag)) 
    # ParseError 
    

    的unicode字符º,即U+00BA,可以如果是在该元件的文字,而不是在元件属性或者标签进行解析。另一方面,可以在元素属性和标签中解析其他Unicode字符,例如中文字符。

    我检查了XML <?xml version="1.1" encoding="UTF-8"?><tagº>text</tagº>https://validator.w3.org/check,并给出了错误:

    Line 1, Column 43: character "º" not allowed in attribute specification list

    然而,在XML Recommendation 1.1, §2.2 Characters,它说,这是允许的:

    Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

    我的问题是,在那里我可以找到XML属性/标签中的无效unicode字符列表?

    +0

    这是关于属性现在或标签名称?标题和最后一句谈论属性,但这些示例仅关于文本和标签。 – lenz

    +2

    无论如何,你只需要在自己链接的文档中滚动一下。例如,[here](https://www.w3.org/TR/xml11/#NT-NameStartChar)是您可以在标签名称中使用哪些字符的定义。 – lenz

    回答

    2

    有关允许在标签文字和属性名的W3C recommendation(要联系自己 - 但你在看什么可以在文本节点中使用的定义),规定如下:

    Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.

    Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

    The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.

    其次是一个formal definition其中列出了很多的Unicode范围:

    NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | 
            [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | 
            [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | 
            [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 
            [#x10000-#xEFFFF] 
    NameChar  ::= NameStartChar | "-" | "." | [0-9] | #xB7 | 
            [#x0300-#x036F] | [#x203F-#x2040] 
    Name   ::= NameStartChar (NameChar)* 
    

    阳性序数指示器º#xBA)不在其中,无论出于何种原因(至少某些语言在缩写中使用它来表示常见词汇,所以它看起来不像“分隔符”)。

    看到您可以在标签名称中使用数字,连字符和句点,但不能作为第一个字符也很有趣。

    相关问题