在Python中删除html标记和字符串

我很新，正则表达式。基本上，我想使用正则表达式使用正则表达式从字符串中删除<sup> ... </sup>。在Python中删除html标记和字符串

输入：

<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>

输出：

<b>something here</b>, another here

是，在如何做到这一点的简便方法和说明？

note这个问题可能会被重复。我试过但找不到解决方案。

来源

2016-08-19 titipata

正则表达式不是处理html的方式，使用html解析器。 html不是一个简单的字符串，它是结构化数据。最容易使用的是beautifulsoup，但它只是一个更高效的库的包装，你也可以使用它。 –

我有像上面那样的短字符串列表。我想使用正则表达式将无需使用HTML解析器 – titipata

难的部分正在知道如何做一个最小化而不是标签之间的最大匹配。这工作。

import re 
s0 = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>" 
prog = re.compile('<sup>.*?</sup>') 
s1 = re.sub(prog, '', s0) 
print(s1) 
# <b>something here</b>, another here

来源

2016-08-19 19:52:47

Ryan用相同的答案殴打。 –

谢谢@Terry。这是非常好的:) – titipata

你可以做这样的事情：

import re 
s = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>" 

s2 = re.sub(r'<sup>(.*?)</sup>',"", s) 

print s2 
# Prints: <b>something here</b>, another here

记住使用(.*?)，作为(.*)就是他们所说的贪婪量词，你会得到不同的结果：

s2 = re.sub(r'<sup>(.*)</sup>',"", s) 

print s2 
# Prints: <b>something here</b>

来源

2016-08-19 19:48:43 Ryan

谢谢@Ryan，这正是我正在寻找的。 – titipata

在Python中删除html标记和字符串

回答

相关问题