正则表达式：去除除SRC以外的HTML属性

我想写一个正则表达式，它将去除除SRC属性以外的所有标记属性。例如：正则表达式：去除除SRC以外的HTML属性

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

将返回：

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

我有一个正则表达式来去除所有属性，但我想调整它在SRC离开。这是我到目前为止：

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');

使用PHP的preg_replace（）为此。

谢谢！ Ian

来源

2010-06-08 Ian McIntyre Silber

您可以使用正则表达式解析HTML。并非所有的HTML。但是如果你确切地知道你正在接收什么，你可以使用正则表达式。这是一场宗教战争，由人们假设所有情况下都有无限的堆叠和记忆。 – 2010-06-08 08:32:32

好吧，这是我用这似乎运作良好：

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

随意戳任何洞。

来源

2010-06-08 21:32:41

Youusuallyshould not parse HTML using regular expressions。

请改为拨打DOMDocument::loadHTML。
然后，您可以通过文档中的元素进行递归并调用removeAttribute。

来源

2010-06-08 02:34:55 SLaks

一些人，当遇到一个问题，认为“我知道，我将使用正则表达式。”现在他们有两个问题。 – fmark 2010-06-08 04:25:42

您可以使用正则表达式解析HTML。并非所有的HTML。但是如果你确切地知道你正在接收什么，你可以使用正则表达式。这是一场宗教战争，由人们假设所有情况下都有无限的堆叠和记忆。 – 2010-06-08 08:32:59

有些人有一个可怕的习惯，就是不回答这个问题，而是沉迷于曼陀。这应该是低调的，而不是由宗教权利提高。 – 2010-06-08 08:33:33

不幸的是，我不知道如何回答这个问题的PHP。如果我是使用Perl我会做到以下几点：

use strict; 
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^; 

$data =~ s{ 
    <([^/> ]+)([^>]+)> # split into tagtype, attribs 
}{ 
    my $attribs = $2; 
    my @parts = split(/\s+/, $attribs); # separate by whitespace 
    @parts = grep { m/^src=/i } @parts; # retain just src tags 
    if (@parts) { 
     "<" . join(" ", $1, @parts) . ">"; 
    } else { 
     "<" . $1 . ">"; 
    } 
}xseg; 

print($data);

<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>

来源

2010-06-08 08:40:59

这可能会为您的工作需要：

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>'; 

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text); 

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

正则表达式细分：

/    # Start Pattern 
<    # Match '<' at beginning of tags 
(   # Start Capture Group $1 - Tag Name 
    [a-z]   # Match 'a' through 'z' 
    [a-z0-9]*  # Match 'a' through 'z' or '0' through '9' zero or more times 
)    # End Capture Group 
(?:   # Start Non-Capture Group 
    [^>]*   # Match anything other than '>', Zero or More Times 
    (   # Start Capture Group $2 - ' src="...."' 
    \s   # Match one whitespace 
    src=   # Match 'src=' 
    ['"]   # Match ' or " 
    [^'"]*  # Match anything other than ' or " 
    ['"]   # Match ' or " 
)    # End Capture Group 2 
)?   # End Non-Capture Group, match group zero or one time 
[^>]*?  # Match anything other than '>', Zero or More times, not-greedy (wont eat the /) 
(\/?)   # Capture Group $3 - '/' if it is there 
>    # Match '>' 
/i   # End Pattern - Case Insensitive

添加一些报价，并使用替换文本<$1$2$3>应该从良好剥离任何非src=性质形成了HTML标签。

请注意这不一定要去上ALL投入工作，因为反HTML + RegExp的人都是这样巧妙下面值得注意。在PHP

来源

2010-06-08 21:52:53 gnarf

除非'>'出现在属性值中。解析邪恶的HTML是_hard_。另外，你忘了逃避'\'。 – SLaks 2010-06-08 22:09:38

哪个'\'我忘了逃跑？ – gnarf 2010-06-08 22:16:43

+1对表达的一个很好的解释。 – Anthony 2012-07-27 15:32:23

有几回退，最值得注意的是<p style=">">竟又<p>">和其他一些破碎的问题......我会建议看Zend_Filter_StripTags作为一个完整的证据标签/属性过滤器如上面介绍的，你不应该使用正则表达式解析html或xml。

我会用str_replace（）做你的例子;如果它的所有时间都一样。

$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>'; 

$str = str_replace('id="paragraph" class="green"', "", $str); 

$str = str_replace('width="50" height="75"',"",$str);

来源

2010-06-08 22:28:54 streetparade

发帖为甲骨文正则表达式

提供解决方案

<([^!][a-z][a-z0-9]*)([^>]*(\ssrc=[''''\"][^''''\"]*[''''\"]))?[^>]*?(\/?)>

来源

2015-06-17 04:37:09

正则表达式：去除除SRC以外的HTML属性

回答

相关问题