获取URL时出现UnicodeEncodeError

我正在使用urlfetch来获取URL。当我尝试将其发送到html2text功能（剥掉所有的HTML标签），我得到以下信息：获取URL时出现UnicodeEncodeError

UnicodeEncodeError: 'charmap' codec can't encode characters in position ... character maps to <undefined>

我一直在尝试处理编码（“UTF-8”，“忽略”）上字符串，但我不断收到此错误。

任何想法？

感谢，

乔尔

一些代码：

result = urlfetch.fetch(url="http://www.google.com") 
html2text(result.content.encode('utf-8', 'ignore'))

和错误消息：

File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode 
return codecs.charmap_encode(input,errors,encoding_table) 
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>

来源

2010-09-12 Joel

请添加'content_type = result.headers.getheader（'Content-Type'）; print（content_type）'到你的代码（在'result = urlfetch.fetch（...）'之后），并告诉我们结果。 – unutbu 2010-09-12 17:01:32

输出结果为：“windows-1255”。我尝试切换到html2text（result.content.decode（'windows-1255'，'ignore'）），但我仍然得到“UnicodeEncodeError：'charmap'编解码器无法编码2-8位字符：字符映射到 “ – Joel 2010-09-12 17:14:34

您需要解码您首先获取的数据！用哪个编解码器？取决于您获取的网站。

当你有unicode并尝试用some_unicode.encode('utf-8', 'ignore')对它进行编码时，我无法想象它是如何抛出错误的。

好吧，你需要做什么：

result = fetch('http://google.com') 
content_type = result.headers['Content-Type'] # figure out what you just fetched 
ctype, charset = content_type.split(';') 
encoding = charset[len(' charset='):] # get the encoding 
print encoding # ie ISO-8859-1 
utext = result.content.decode(encoding) # now you have unicode 
text = utext.encode('utf8', 'ignore') # encode to uft8

这不是真正强大的，但它应该给你带路。

来源

2010-09-12 16:35:10

对不起，我的意思是解码..我的错误！ – Joel 2010-09-12 16:39:02

我知道我需要使用哪个编解码器吗？对google.com说吧 – Joel 2010-09-12 16:43:46

@Joel：你需要解码的编解码器在HTTP标头或HTML元标记中（或未指定，那么你必须猜测）。谷歌是一个不好的例子，因为你得到的网站取决于你住的地方：p – 2010-09-12 16:47:37

获取URL时出现UnicodeEncodeError

回答

相关问题