在Python

转义特殊的HTML字符我有哪里像'或"或&（...）特殊字符可以出现的字符串。在字符串：在Python

string = """ Hello "XYZ" this 'is' a test & so on """

我怎么能自动跳脱每一个特殊字符，让我得到这个：

string = " Hello &quot;XYZ&quot; this &#39;is&#39; a test &amp; so on "

来源

2010-01-16 creativz

在Python 3.2，你可以使用html.escape function，例如

>>> string = """ Hello "XYZ" this 'is' a test & so on """ 
>>> import html 
>>> html.escape(string) 
' Hello &quot;XYZ&quot; this &#x27;is&#x27; a test &amp; so on '

对于早期版本的Python，检查http://wiki.python.org/moin/EscapingHtml：

附带Python中的cgi module有一个escape() function：
import cgi 

s = cgi.escape("""& < >""") # s = "&amp; &lt; &gt;" 
然而，这并不转义字符超出&， <和>。如果它被用作cgi.escape(string_to_escape, quote=True)，它也逃脱"。

这里是一个小片段，让你逃脱引号和撇号，以及：
html_escape_table = { 
    "&": "&amp;", 
    '"': "&quot;", 
    "'": "&apos;", 
    ">": "&gt;", 
    "<": "&lt;", 
    } 

def html_escape(text): 
    """Produce entities within text.""" 
    return "".join(html_escape_table.get(c,c) for c in text) 
您还可以使用escape() from xml.sax.saxutils逃脱HTML。这个函数应该执行得更快。相同模块的unescape()函数可以传递相同的参数来解码字符串。
from xml.sax.saxutils import escape, unescape 
# escape() and unescape() takes care of &, <and>. 
html_escape_table = { 
    '"': "&quot;", 
    "'": "&apos;" 
} 
html_unescape_table = {v:k for k, v in html_escape_table.items()} 

def html_escape(text): 
    return escape(text, html_escape_table) 

def html_unescape(text): 
    return unescape(text, html_unescape_table) 

来源

2010-01-16 12:30:29 kennytm

谢谢你'报价= TRUE;在'CGI。转义' – sidx 2015-12-29 11:16:12

请注意，您的一些替代品不符合HTML标准。例如：https：//www.w3.org/TR/xhtml1/#C_16而不是'，使用'我想其他一些人被添加到HTML4标准，但那不是。 – leetNightshade 2017-11-30 00:32:54

的cgi.escape方法特别charecters转换为有效的HTML标签

import cgi 
original_string = 'Hello "XYZ" this \'is\' a test & so on ' 
escaped_string = cgi.escape(original_string, True) 
print original_string 
print escaped_string

将导致

Hello "XYZ" this 'is' a test & so on 
Hello &quot;XYZ&quot; this 'is' a test &amp; so on

可选的第二放慢参数上cgi.escape逃脱的报价。默认情况下，他们都没有逃过

来源

2010-01-16 12:34:34

我不明白为什么cgi.escape对转换引号非常敏感，并且完全忽略了单引号。 – 2010-01-16 13:11:24

因为引号不需要在PCDATA中转义，所以它们*需要在属性中转义（这通常使用双引号分隔符），前者比后者更普遍。一般来说，这是一本教科书90％的解决方案（更像是> 99％）。如果你必须保存每一个最后一个字节，并且希望它能动态确定哪种类型的引用是这样做的，请使用'xml.sax.saxutils.quoteattr（）'。 – 2010-01-16 13:16:29

简单的字符串函数会做到这一点：

def escape(t): 
    """HTML-escape the text in `t`.""" 
    return (t 
     .replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;") 
     .replace("'", "&#39;").replace('"', "&quot;") 
     )

在此线程的其他答案有小问题：由于某种原因，cgi.escape方法忽略单引号，和你需要明确要求它做双引号。链接的wiki页面全部五个，但使用不是HTML实体的XML实体'。

这个代码函数做所有五个所有的时间，使用HTML标准的实体。

来源

2010-01-16 13:10:04

这里其他的答案将有助于如您列出的字符和其他几个人。但是，如果您还想将其他所有内容转换为实体名称，则必须执行其他操作。例如，如果á需求转换为á，既不cgi.escape也不html.escape将帮助你。你会想这样做，使用html.entities.entitydefs，这只是一个字典。（下面的代码为Python 3.x的制作，但有以使其与2.x的兼容部分试图给你一个想法）：

# -*- coding: utf-8 -*- 

import sys 

if sys.version_info[0]>2: 
    from html.entities import entitydefs 
else: 
    from htmlentitydefs import entitydefs 

text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert 
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names. 
text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names 

if sys.version_info[0]>2: #Using appropriate code for each Python version. 
    for k,v in entitydefs.items(): 
     if k not in {"semi", "amp"}: 
      text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. 
else: 
    for k,v in entitydefs.iteritems(): 
     if k not in {"semi", "amp"}: 
      text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. 

#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter: 
text=text.replace("ŷ", "&ycirc;") 
text=text.replace("Ŷ", "&Ycirc;") 
text=text.replace("ŵ", "&wcirc;") 
text=text.replace("Ŵ", "&Wcirc;") 
text=text.replace("ỳ", "&#7923;") 
text=text.replace("Ỳ", "&#7922;") 
text=text.replace("ẃ", "&wacute;") 
text=text.replace("Ẃ", "&Wacute;") 
text=text.replace("ẁ", "&#7809;") 
text=text.replace("Ẁ", "&#7808;") 

print(text) 
#Python 3.x outputs: &semi;&quot;&aacute;&egrave;&iuml;&oslash;&aelig;&#7923; 
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.

来源

2014-06-23 19:20:41 Shule

回答

相关问题