XMLParser的在权利要求菲罗U + 00A0是 “无效UTF-8”

鉴于输入： “”XMLParser的在权利要求菲罗U + 00A0是 “无效UTF-8”

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?> 
<sms body=". what" />

当字符之后的在短信标签的身体属性中是U+00A0;

我得到的错误：

XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)

IIUC，该字符的UTF-8表示为0xC2 0xA0per Wikipedia。当然，输入字节72和73分别是194和160。

这看起来像是XMLParser中的一个错误，或者我错过了什么？

来源

2016-07-28 Sean DeNigris

不能再现：'XMLDOMParser解析： '<？XML版本=' '1.0'”编码= '' UTF-8'独立=''yes''？> '' –

由于蒙蒂光临救援on the Pharo User's list：

You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:

The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.

There is an encoding declaration with a non-UTF-8 encoding.

There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.

来源

2016-08-08 12:45:20

XMLParser的在权利要求菲罗U + 00A0是 “无效UTF-8”

回答

相关问题