get_text（）有UnicodeEncodeError

我有以下HTML：get_text（）有UnicodeEncodeError

<div class="dialog"> 
<div class="title title-with-sort-row"> 
    <h2>Description</h2> 
    <div class="dialog-search-sort-bar"> 
    </div> 
</div> 
<div class="content"><div style="margin-right: 20px; margin-left: 30px;"> 
    <span class="description2"> 
     With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
     She is made available under a Creative Commons License that gives endless opportunities for further development. 
     This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
     The result is a figure that has very good bending and morphing behavior. 
     <br /> 
    </span> 
</div> 
</div>

我需要找到这个div出class="dialog"数的div，然后拉出在span class="description2"文本。

当我使用的代码：

description = soup.find(text = re.compile('Description')) 
if description != None: 
    someEl = description.parent 
    parent1 = someEl.parent 
    parent2 = parent1.parent 
    description = parent2.find('span', {'class' : 'description2'}) 
    print 'Description: ' + str(description)

我得到：

<span class="description2"> 
    With Â“Antonia Polygon Â– StandardÂ”, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior. 
    <br/> 
</span>

如果我试图让只是文本，而HTML &非ASCII字符，使用

description = description.get_text()

我收到一个(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'

如何将这个HTML块转换为直线ascii？

来源

2012-04-22 Stephen

字符'''不是ASCII字符。您的目标是确定最相似的字符是ASCII（这很难），或者您的目标是简单地移除所有非ASCII字符？或者是你真正想要输出正确的Unicode，例如UTF-8，而不是ASCII？ – jogojapan 2012-04-23 02:04:45

只是删除所有非ASCII字符 – Stephen 2012-04-24 21:44:22

强制：http://bit.ly/unipain – Daenyth 2012-05-07 12:55:56

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

foo = u'With Â“Antonia Polygon Â– StandardÂ”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.' 

print foo.encode('ascii', 'ignore')

有三件事要注意。

首先是'ignore'参数的编码方法。它指示方法删除不在所选编码范围内的字符（在这种情况下，ascii为安全）。

其次是我们明确地将foo设置为unicode，方法是在字符串前加上u。

三是显式文件编码指令：# -*- coding: utf8 -*-。

另外，如果你在阅读这个答案时没有阅读Daenyth的评论，那么你就是一个愚蠢的人。如果要在HTML/XML中使用输出，则可以使用xmlcharrefreplace代替上面的ignore，以取得很好的公正性。

来源

2012-05-07 12:31:04 JosefAssad

在这种情况下使用'xmlcharrefreplace'作为第二个参数将会好很多，因为他正在处理html。 – Daenyth 2012-05-07 12:54:53

是的，我同意。我只是懒惰，因为OP在评论中说，他只是想删除所有的行为不端的字符。 :) – JosefAssad 2012-05-07 12:59:54

不过，值得一提的是，如果他们有类似的问题，其他人可能会遇到这种情况。 – Daenyth 2012-05-07 13:02:23

get_text（）有UnicodeEncodeError

回答

相关问题