2012-04-22 40 views
0

我有以下HTML:get_text()有UnicodeEncodeError

<div class="dialog"> 
<div class="title title-with-sort-row"> 
    <h2>Description</h2> 
    <div class="dialog-search-sort-bar"> 
    </div> 
</div> 
<div class="content"><div style="margin-right: 20px; margin-left: 30px;"> 
    <span class="description2"> 
     With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
     She is made available under a Creative Commons License that gives endless opportunities for further development. 
     This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
     The result is a figure that has very good bending and morphing behavior. 
     <br /> 
    </span> 
</div> 
</div> 

我需要找到这个div出class="dialog"数的div,然后拉出在span class="description2"文本。

当我使用的代码:

description = soup.find(text = re.compile('Description')) 
if description != None: 
    someEl = description.parent 
    parent1 = someEl.parent 
    parent2 = parent1.parent 
    description = parent2.find('span', {'class' : 'description2'}) 
    print 'Description: ' + str(description) 

我得到:

<span class="description2"> 
    With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior. 
    <br/> 
</span> 

如果我试图让只是文本,而HTML &非ASCII字符,使用

description = description.get_text() 

我收到一个(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'

如何将这个HTML块转换为直线ascii?

+0

字符'''不是ASCII字符。您的目标是确定最相似的字符是ASCII(这很难),或者您的目标是简单地移除所有非ASCII字符?或者是你真正想要输出正确的Unicode,例如UTF-8,而不是ASCII? – jogojapan 2012-04-23 02:04:45

+0

只是删除所有非ASCII字符 – Stephen 2012-04-24 21:44:22

+0

强制:http://bit.ly/unipain – Daenyth 2012-05-07 12:55:56

回答

2
#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

foo = u'With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.' 

print foo.encode('ascii', 'ignore') 

有三件事要注意。

首先是'ignore'参数的编码方法。它指示方法删除不在所选编码范围内的字符(在这种情况下,ascii为安全)。

其次是我们明确地将foo设置为unicode,方法是在字符串前加上u

三是显式文件编码指令:# -*- coding: utf8 -*-

另外,如果你在阅读这个答案时没有阅读Daenyth的评论,那么你就是一个愚蠢的人。如果要在HTML/XML中使用输出,则可以使用xmlcharrefreplace代替上面的ignore,以取得很好的公正性。

+1

在这种情况下使用'xmlcharrefreplace'作为第二个参数将会好很多,因为他正在处理html。 – Daenyth 2012-05-07 12:54:53

+0

是的,我同意。我只是懒惰,因为OP在评论中说,他只是想删除所有的行为不端的字符。 :) – JosefAssad 2012-05-07 12:59:54

+1

不过,值得一提的是,如果他们有类似的问题,其他人可能会遇到这种情况。 – Daenyth 2012-05-07 13:02:23