utf8编解码器无法解码python中的字节0x96

我正在尝试检查某个单词是否位于许多网站的页面上。该脚本运行良好，说15个网站，然后停止。utf8编解码器无法解码python中的字节0x96

的UnicodeDecodeError：“UTF-8”编解码器不能解码位置15344字节0x96：无效的起始字节

我做了一个计算器搜索和发现了很多问题，但我似乎无法理解在我的情况下出了问题。

我想解决它，或者如果跳过该网站有错误。请教我如何做到这一点，因为我是新手，下面的代码本身让我花了一天的时间写作。顺便说该脚本上暂停该网站是http://www.homestead.com

filetocheck = open("bloglistforcommenting","r") 
resultfile = open("finalfile","w") 

for countofsites in filetocheck.readlines(): 
     sitename = countofsites.strip() 
     htmlfile = urllib.urlopen(sitename) 
     page = htmlfile.read().decode('utf8') 
     match = re.search("Enter your name", page) 
     if match: 
      print "match found : " + sitename 
      resultfile.write(sitename+"\n") 

     else: 
      print "sorry did not find the pattern " +sitename 

print "Finished Operations"

按照马克的意见，我改变了代码来实现beautifulsoup

htmlfile = urllib.urlopen("http://www.homestead.com") 
page = BeautifulSoup((''.join(htmlfile))) 
print page.prettify()

现在我收到此错误

page = BeautifulSoup((''.join(htmlfile))) 
TypeError: 'module' object is not callable

我正在尝试从http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start开始的快速入门示例。如果我复制粘贴它然后代码工作正常。

我最终得到它的工作。感谢大家的帮助。这是最终的代码。

import urllib 
import re 
from BeautifulSoup import BeautifulSoup 

filetocheck = open("listfile","r") 

resultfile = open("finalfile","w") 
error ="for errors" 

for countofsites in filetocheck.readlines(): 
     sitename = countofsites.strip() 
     htmlfile = urllib.urlopen(sitename) 
     page = BeautifulSoup((''.join(htmlfile))) 
     pagetwo =str(page) 
     match = re.search("Enter YourName", pagetwo) 
     if match: 
      print "match found : " + sitename 
      resultfile.write(sitename+"\n") 

     else: 
      print "sorry did not find the pattern " +sitename 

print "Finished Operations"

来源

2011-10-24 Vishal Khialani

许多网页编码不正确。解析HTML请尝试BeautifulSoup，因为它可以处理在野外发现的许多类型的错误HTML。

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

强调我的。

来源

2011-10-24 09:29:44

我宁愿跳过这个网站，我可以像解码一样做（'utf8'，somecodeforerrortoskip） –

user976847：使用BeautifulSoup还有很多其他优势。我认为你应该放弃它。 –

我看看它谢谢 –

该网站 'http://www.homestead.com' 并不声称向您发送UTF-8，反应居然声称是ISO-8859-1：

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

你必须为您实际收到的网页使用正确的编码，而不是随机猜测。

来源

2011-10-24 09:35:25 Duncan

事情是我有一个巨大的网站列表，这只是第一个的许多错误。如果我发现解码错误，跳过网站的最佳方式是什么？ –

'charset = ISO-8859-1'是“邮件中的支票”的网络等价物。 –

15344处的字节是0x96。推测在位置15343处有一个字符的单字节编码或多字节编码的最后一个字节，使15344成为字符的开始。 0x96是二进制10010110，任何与模式10XXXXXX（0x80到0xBF）匹配的字节只能是UTF-8编码中的第二个或后续字节。

因此，流不是UTF-8，否则会损坏。

检查您链接到URI，我们找到头：

Content-Type: text/html

由于没有编码声明，我们应该使用HTTP的默认，这是ISO-8859-1（又名“拉丁1 “）。

检查发现行内容：

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

这是人谁是，由于某种原因，无法正确地设置自己的HTTP标题的回退机制。这次我们明确地告诉字符编码是ISO-8859-1。

因此，没有理由期望将其作为UTF-8工作。

对于额外的乐趣，虽然，当我们考虑到在ISO-8859-1编码0x96 U + 0096是控制字符“防护区域开始”，我们发现，ISO-8859-1不正确或者 。看起来创建页面的人对你自己犯了类似的错误。

从上下文来看，他们似乎实际上使用了Windows-1252，因为在编码0x96编码U + 2013（EN-DASH，看起来像–）。

因此，解析这个特定的页面，你想在Windows-1252解码。更一般地说，当你选择字符编码时，你想要检查标题，虽然在这种情况下它可能是不正确的（或者，也许不是，多个“ISO-8859-1”编解码器实际上是Windows-1252），你会更经常地改正。通过阅读和回退，你仍然需要有这样的失误。 decode方法采用称为errors的第二个参数。默认值为'strict'，但您也可以有'ignore','replace','xmlcharrefreplace'（不适用），'backslashreplace'（不适用），并且您可以使用codecs.register_error()注册自己的回退处理程序。

来源

2011-10-24 09:58:35

要修复嵌入在utf-8中的Windows-1252内容，您可以使用['bs4.UnicodeDammit.detwingle（）']（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#inconsistent-encodings ） – jfs

深入解答，解释错误（几乎肯定）是什么。不幸的是，如果不在字节级别上理解这些东西是不可能的，当然，很多人还没有做好准备。感谢您多走一步:-) – Forbesmyester

utf8编解码器无法解码python中的字节0x96

回答

相关问题