Python：我使用.decode（） - 'ascii'编解码器无法编码

这似乎是我使用了错误的函数。随着.fromstring - 那里是没有错误消息Python：我使用.decode（） - 'ascii'编解码器无法编码

xml_ = load() # here comes the unicode string with Cyrillic letters 

print xml_ # prints everything fine 

print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode 

xml = xml_.decode('utf-8') # here is an error 

doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here 

File "testLog.py", line 48, in <module> 
    xml = xml_.decode('utf-8') 
    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)

如果

xml = xml_.encode('utf-8') 

doc = lxml.etree.parse(xml) # here's an error

或

xml = xml_

然后

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)

如果我的理解对不对：我米ust将非ASCII字符串解码为内部表示形式，然后使用这种表示形式并在发送到输出之前对其进行编码？看来我正是这样做的。

由于标头为'Accept-Charset': 'utf-8'，输入数据必须位于非8位。

来源

2012-07-08 Ben Usman

错误仍然是关于etree.parse（）调用上的字符编码？什么是XML的类型？ etree.parse在字符串或unicode对象上不起作用。尝试使用etree.fromstring（）代替。 – hasanyasin 2012-07-08 18:06:18

@hasanyasin，看起来你是对的。 :) – 2012-07-08 18:08:24

我会写一个很好的答案，涵盖希望你会接受的两个问题是正确的答案。 :) – hasanyasin 2012-07-08 18:09:19

对我而言，使用.fromstring()方法是需要的。

来源

2014-03-18 20:15:01

如果您的原始字符串是unicode，则只有将它编码为utf-8才能解码utf-8。

我认为xml解析器只能处理ascii的xml。

因此，请使用xml = xml_.encode('ascii','xmlcharrefreplace')将不在ascii中的unicode字符转换为xml实体。

来源

2012-07-08 17:56:53

然后同样的错误出现一个字符串较低。 – 2012-07-08 17:58:23

我现在明白了。请看看编辑过的问题。 – 2012-07-08 18:05:30

@hasanyasin：我将unicode字符串编码为ascii编码中的字节。这很可能。西里尔字符串被翻译成xml实体。例如'Ж'成为'Ж'。 – 2012-07-08 18:32:43

lxml库已经将东西放到unicode类型中。你正在运行python2的unicode/bytes自动转换。其中的提示是，你问它decode，但你得到一个编码错误。它试图将您的utf8字符串转换为默认字节编码，然后将其解码回unicode。

使用unicode对象上的.encode方法转换为字节（str类型）。

看着这会教你很多关于如何解决这个问题：http://nedbatchelder.com/text/unipain.html

来源

2012-07-08 17:58:14 Daenyth

我假设你正在试图解析一些网站？

您是否有效该网站是正确的？也许他们的编码是不正确的？

许多网站被打破，并依靠网络浏览器有很健壮的分析器。你可以尝试一下，它也很健壮。

有事实上的网络标准，在“字符集” HTML头（其中可能包括谈判和涉及接受编码你提到）是任何<meta http-equiv=...标签在HTML文件中否决！

所以你可能只是不是有一个UTF-8输入！

来源

2012-07-08 18:06:54

字符串和Unicode对象在内存中具有不同的类型和不同的内容表示形式。 Unicode是文本的解码形式，而字符串是编码形式。

# -*- coding: utf-8 -- 

# Now, my string literals in this source file will 
# be str objects encoded in utf-8. 

# In Python3, they will be unicode objects. 
# Below examples show the Python2 way. 

s = 'ş' 
print type(s) # prints <type 'str'> 

u = s.decode('utf-8') 
# Here, we create a unicode object from a string 
# which was encoded in utf-8. 

print type(u) # prints <type 'unicode'>

正如你看到的，

.encode() --> str 
.decode() --> unicode

当我们编码或解码的字符串，我们需要确保我们的文本应在源/目标编码覆盖。 iso-8859-1编码的字符串不能用iso-8859-9正确解码。

至于问题中的第二个错误报告，lxml.etree.parse()对文件类对象有效。要从字符串解析，应使用lxml.etree.fromstring()。

来源

2012-07-08 18:21:17 hasanyasin

Python：我使用.decode（） - 'ascii'编解码器无法编码

回答

相关问题