没有Python unicode错误下载html

我想下载page_source到一个文件。然而，每一次我得到一个：没有Python unicode错误下载html

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 (or something else) in 
position 8304: ordinal not in range(128)

我使用value.encode('utf-8')尝试过，但似乎每次抛出同样的异常时间（除了手动试图取代所有的非ASCII字符）。有没有一种方法可以“预处理”HTML以将其变为“可写”格式？

来源

2012-01-09 David542

什么是文件的实际编码？ – 2012-01-09 03:11:08

使用UTF8 _而不是_ ASCII。 – SLaks 2012-01-09 03:15:09

有第三方库，如BeautifulSoup和lxml可以自动处理编码问题。但这里是一个使用最原始的例子只是urlllib2：

首先下载一些网页包含非ASCII字符：

>>> import urllib2 
>>> response = urllib2.urlopen('http://www.ltg.ed.ac.uk/~richard/unicode-sample.html') 
>>> data = response.read()

现在看看在“字符集”页面的顶部：

>>> data[:200] 
'<html>\n<head>\n<title>Unicode 2.0 test page</title>\n<meta 
content="text/html; charset=UTF-8" http-equiv="Content-type"/>\n 
</head>\n<body>\n<p>This page contains characters from each of the 
Unicode\ncharact'

如果没有明显的字符集，无论如何，“UTF-8”通常都是一个很好的猜测。

最后，网页转换为Unicode文本：

>>> text = data.decode('utf-8')

来源

2012-01-09 05:24:17 ekhumoro

谢谢，这解决了我的问题。当用一个基本的python脚本下载页面时，我得到了一个带有xce \ xbf \ xb9等的html页面。 – 2016-12-12 21:38:54

我不确定，但http://www.crummy.com/software/BeautifulSoup/有一个函数.prettify（），它返回格式良好的HTML。您可以尝试将其用于“预处理”。

来源

2012-01-09 03:11:04

这个问题可能是你试图去str - >utf-8，当你需要去str - >unicode - >utf-8。换句话说，试试unicode(s, 'utf-8').encode('utf-8')。

有关更多信息，请参见http://farmdev.com/talks/unicode/。

来源

2012-01-09 03:29:08

没有Python unicode错误下载html

回答

相关问题