£显示在urllib2和美丽的汤

我想在python中写一个小型web刮板，我想我遇到了一个编码问题。我想刮（在页面上专门的表格）http://www.resident-music.com/tickets - 一个行可能是这个样子 -£显示在urllib2和美丽的汤

<tr> 
     <td style="width:64.9%;height:11px;"> 
     <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p> 
     </td> 
     <td style="width:13.1%;height:11px;"> 
     <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p> 
     </td> 
     <td style="width:15.42%;height:11px;"> 
     <p><strong>various</strong></p> 
     </td> 
     <td style="width:6.58%;height:11px;"> 
     <p><strong>&pound;55.00</strong></p> 
     </td> 
     </tr>

我基本上是试图用£55，以取代£55.00，和其他任何“非文字'脏话。

我已经尝试了几种不同的编码方式，你可以用beautifulsoup和urllib2去 - 无济于事，我想我只是做了一切错误。

感谢

来源

2016-09-30 Ollie

你想UNESCAPE的HTML，你可以做使用html.unescape在python3：

In [14]: from html import unescape 

In [15]: h = """<tr> 
    ....:   <td style="width:64.9%;height:11px;"> 
    ....:   <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p> 
    ....:   </td> 
    ....:   <td style="width:13.1%;height:11px;"> 
    ....:   <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p> 
    ....:   </td> 
    ....:   <td style="width:15.42%;height:11px;"> 
    ....:   <p><strong>various</strong></p> 
    ....:   </td> 
    ....:   <td style="width:6.58%;height:11px;"> 
    ....:   <p><strong>&pound;55.00</strong></p> 
    ....:   </td> 
    ....:  </tr>""" 

In [16]: 

In [16]: print(unescape(h)) 
<tr> 
     <td style="width:64.9%;height:11px;"> 
     <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p> 
     </td> 
     <td style="width:13.1%;height:11px;"> 
     <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p> 
     </td> 
     <td style="width:15.42%;height:11px;"> 
     <p><strong>various</strong></p> 
     </td> 
     <td style="width:6.58%;height:11px;"> 
     <p><strong>£55.00</strong></p> 
     </td> 
     </tr>

对于python2使用：

In [6]: from html.parser import HTMLParser 

In [7]: unescape = HTMLParser().unescape 

In [8]: print(unescape(h)) 
<tr> 
     <td style="width:64.9%;height:11px;"> 
     <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p> 
     </td> 
     <td style="width:13.1%;height:11px;"> 
     <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p> 
     </td> 
     <td style="width:15.42%;height:11px;"> 
     <p><strong>various</strong></p> 
     </td> 
     <td style="width:6.58%;height:11px;"> 
     <p><strong>£55.00</strong></p> 
     </td>

你可以同时看到正确UNESCAPE所有实体不只是英镑符号。

来源

2016-10-01 00:01:49

我用requests这个但希望你能做到这一点用也urllib2。所以这是代码：

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import requests 
from BeautifulSoup import BeautifulSoup 

soup = BeautifulSoup(requests.get('your_url').text) 
chart = soup.findAll(name='tr') 
print str(chart).replace('&pound;',unichr(163)) #replace '&pound;' with '£'

现在你应该采取预期的输出！

输出示例：

... 
<strong>£71.50</strong></p> 
...

反正关于解析你可以用很多方法去做，什么是有趣这里是：print str(chart).replace('£',unichr(163))这是相当具有挑战性:)

更新

如果你想逃离多个（甚至一个）字符（如破折号，磅等），对你来说会更容易/更有效率在Padraic的回答中使用parser。有时你也会阅读他们处理的评论和其他编码问题。

来源

2016-09-30 22:23:22 coder

这不是你想如何使用unescape html，这意味着调用替换页面上的每个转义实体，并且初始str本身也可能导致编码错误。我也不会鼓励使用BeautifulSoup3。 –

我尊重你的评论，但我会不同意你的看法，如果你看看这里：https：//wiki.python.org/moin/EscapingHtml你会看到那些准备好的库做的和我一样代码行，不同之处在于它们会给我准备好的结果，我个人不赞成。在某些情况下，他们完成这项工作，但这是一项非常具体且简单的任务。至于'bs3'而不是'bs4'，对于OP想要做什么来说并不重要。但我也尊重你的意见！ – coder

*我基本上试图用55英镑，**和任何其他“非文字”的脏东西来代替£ 55.00。***。 *其他'非文字'nasties *是逃脱的实体，可能是众多的任何一个。它也很重要，bs3被打破，不再维护。 –

£显示在urllib2和美丽的汤

回答

相关问题