与蟒蛇

删除特定HTML标签我有一个HTML细胞内的一些HTML表格，就像这样：与蟒蛇

miniTable='<table style="width: 100%%" bgcolor="%s"> 
       <tr><td><font color="%s"><b>%s</b></td></tr> 
      </table>' % (bgcolor, fontColor, floatNumber) 

html += '<td>' + miniTable + '</td>'

有没有办法去除，涉及到这个minitable HTML标记，并只有这些html标签？
我想以某种方式删除这些标签：

<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b> 
and 
</b></td></tr></table>

得到这个：

floatNumber

其中floatNumber是一个浮点数的字符串表示。 我不希望任何其他HTML标记以任何方式进行修改。我想使用string.replace或正则表达式，但我很难过。

来源

2012-07-13 jh314

如果您不能安装和使用美丽的汤（否则BS是首选，因为@奥托allmendinger建议）：

import re 
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>' 
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))

来源

2012-07-13 14:43:20 fedosov

对于我的应用程序，这个工程太棒了！如果我可以使用美丽的汤，奥托的解决方案也很棒 – jh314 2012-07-13 15:22:29

Do not use str.replace or regex.

使用HTML解析库像Beautiful Soup，得到你想要的元素，包含的文本。

最后的代码应该是这个样子

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html_doc) 

for t in soup.find_all("table"): # the actual selection depends on your specific code 
    content = t.get_text() 
    # content should be the float number

来源

2012-07-13 14:40:06

谢谢为了快速回复！我正在使用一些专有的开发环境，所以我无法安装和使用美丽的汤 – jh314 2012-07-13 14:43:30

如果html代码格式良好，您还可以尝试使用Python内置的XML解析器。 – 2012-07-13 14:46:47

有趣，但[BS4使用're'解析XHTML]（http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L482）。不要使用正则表达式？好的。 – fedosov 2012-07-13 14:51:58

回答

相关问题