python删除html标签，包括html实体，但不包含具有'＆'前缀的正常文本

我想删除包含html实体的html标签，例如python 2.7中的&，但是我的输入文本包含以字母&开头的普通文本，我不想删除这些文字。我正在尝试在这篇文章中投票最多的答案：Strip HTML from strings in Python。唯一的区别是，我用space替换了html标签。python删除html标签，包括html实体，但不包含具有'＆'前缀的正常文本

from HTMLParser import HTMLParser 

class MLStripper(HTMLParser): 
    def __init__(self): 
     self.reset() 
     self.fed = [] 
    def handle_data(self, d): 
     self.fed.append(d) 
    def get_data(self): 
     return ' '.join(self.fed) 

def strip_tags(html): 
    s = MLStripper() 
    s.feed(html) 
    return s.get_data() 

print strip_tags('html tags<p>will be&amp;replaced</p>with space. NOT this &abc') 
# Now the output is: "html tags will be replaced with space. NOT this " 
# The wanted output is: "html tags will be replaced with space. NOT this &abc"

如何输出正确的文本？

来源

2015-09-05 DehengYe

那么＆字符是html中的一个特殊字符，所以＆abc应该是& abc解析器的行为是正确的。 – e4c5

＆也可以出现在一个URL中。如果我的输入文字有URL，那么上面的代码会使URL无效。 @ e4c5 – DehengYe

你的问题说输入文字没有关于网址。对于href标记中的＆字符，它们将通过html解析器库正确处理。 – e4c5

你可以尝试BeautifulSoup：

>>> html = '<div><p>&abc is <b>my</b> input text</p></div>' 
>>> print strip_tags(html) 
is my input text 

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(html) 
>>> print soup.text 
&abc is my input text 
>>> soup = BeautifulSoup('=&abc= is my input text') 
>>> soup.text 
u'=&abc= is my input text'

注意，您strip_tags()不能正常剥离嵌套<b>标签，我添加到您的测试字符串。

如果你想坚持使用标准的HTMLParser，那么存在another answer这个问题，你链接到这个问题做得更好。对于我的测试字符串，它将输出&abc; is my input text，即它将逃离独立的&。我不确定你后面的输出。

更新

这工作：

import re 
from HTMLParser import HTMLParser 
from htmlentitydefs import entitydefs 

class MLStripper(HTMLParser): 
    def __init__(self): 
     self.reset() 
     self.fed = [] 
     self.entityref = re.compile('&[a-zA-Z][-.a-zA-Z0-9]*[^a-zA-Z0-9]') 

    def handle_data(self, d): 
     self.fed.append(d) 

    def handle_starttag(self, tag, attrs): 
     self.fed.append(' ') 

    def handle_endtag(self, tag): 
     self.fed.append(' ') 

    def handle_entityref(self, name): 
     if entitydefs.get(name) is None: 
      m = self.entityref.match(self.rawdata.splitlines()[self.lineno-1][self.offset:]) 
      entity = m.group() 
      # semicolon is consumed, other chars are not. 
      if entity[-1] != ';': 
       entity = entity[:-1] 
      self.fed.append(entity) 
     else: 
      self.fed.append(' ') 

    def get_data(self): 
     self.close() # N.B. ensure all buffered data has been processed 
     return ''.join(self.fed) 

def strip_tags(html): 
    s = MLStripper() 
    s.feed(html) 
    return s.get_data() 

print strip_tags('html &zzz; tags<p>&zzz &zz: will be&amp;replaced</p>with space. NOT this &abc')

输出

 
html &zzz; tags &zzz &zz: will be replaced with space. NOT this &abc

该代码添加处理程序，其由单个空格替换开始和结束标记。实体引用也通过用空格替换已知的有效引用来处理，并且保持未知的引用不变。

另一个重要问题是在调用get_data()之前，请在解析器上调用close()。我把它放在get_data()的方法中，尽管你可以将它添加到strip_tags()函数中。如果多次调用close()，我不认为这很重要，因此您可以调用get_data()，然后将更多数据提供给解析器。

来源

2015-09-05 01:23:47 mhawke

该答案用空值代替html标签。我想要替换html标签，但不是＆abc空格，所以我将其更改为''.join（）。你能帮忙吗？ – DehengYe

我得到了&abc;是我使用你的代码的输入文本。有一个不需要的分号。 – DehengYe

@DehengYe：如果您的意思是您使用了BeautifulSoup示例，它会生成您在问题中显示的输出。我已经更新了答案以表明这一点。没有分号。您能否在HTML标签添加到字符串时显示一个清晰的例子？开始标签，结束标签还是两者都应该用空格替换？ – mhawke

python删除html标签，包括html实体，但不包含具有'＆'前缀的正常文本

回答

相关问题