2017-10-18 93 views
2

输入是一个Wikipedia页面的第一个段落。我想删除括号和括号之间的任何内容。正则表达式模式以去除括号(和内部的任何括号内)

然而,有时(通常),HTML内容括号内本身含有一个或数个括号,一般在一个链路的href=""

采取以下:

<p> 
    The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>. 
</p> 

我想最终的结果是:

<p> 
    The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>. 
</p> 

但是当我使用下面的preg_replace模式它不工作,成为它就会迷茫圆括号内的括号。

public function removeParentheses($content) { 

    $pattern = '@\(.*?\)@'; 
    $content = preg_replace($pattern, '', $content); 
    $content = str_replace(' .', '.', $content); 
    $content = str_replace(' ', ' ', $content); 
    return $content; 
} 

其次,我怎么能离开内部链接href=""title=""括号?这些,如果不在文本括号内,则很重要。

+1

正则表达式不能处理递归。如果你有一些递归模式(括号内括号..)你需要更多的逻辑 - 即写一个解析器 – Philipp

+1

不要用正则表达式解析HTML。正如@Philipp所说,它无法有效地做到这一点(当然,你可以一起使用一个可行的版本,但我保证你可以通过HTML中的一些不明确的东西来打破它)。使用像[SimpleXML的]一个XML解析器(http://php.net/manual/en/simplexml.examples.php) – ctwheels

+0

你可能要参考https://stackoverflow.com/questions/3577641/how-do-you -parse和工艺-HTML-XML功能于PHP的工具列表,如果试图解析HTML用PHP – Jeff

回答

2

可以代替所有的占位符的链接,然后删除所有括号,并在年底替换占位符回到其原始值。

这与preg_replace_callback()完成,传递一个事件计数器和更换阵列保持联系的轨道,然后使用自己的removeParentheses()摆脱括号,最后用str_replace()array_keys()array_values()让你回链接:

<?php 
$string = '<p> 
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> (from Greek σαρξ <i>sarx</i>, flesh, and πτερυξ <i>pteryx</i>, fin) – sometimes considered synonymous with <b>Crossopterygii</b> ("fringe-finned fish", from Greek κροσσός <i>krossos</i>, fringe) – constitute a <a href="/wiki/Clade" title="Clade">clade</a> (traditionally a <a href="/wiki/Class_(biology)" title="Class (biology)">class</a> or subclass) of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>. 
</p>'; 
$occurrences = 0; 
$replacements = []; 
$replacedString = preg_replace_callback("/<a .*?>.*?<\/a>/i", function($el) use (&$occurrences, &$replacements) { 
    $replacements["|||".$occurrences] = $el[0]; // the ||| are just to avoid unwanted matches 
    return "|||".$occurrences++; 
}, $string); 
function removeParentheses($content) { 
    $pattern = '@\(.*?\)@'; 
    $content = preg_replace($pattern, '', $content); 
    $content = str_replace(' .', '.', $content); 
    $content = str_replace(' ', ' ', $content); 
    return $content; 
} 
$replacedString = removeParentheses($replacedString); 
$replacedString = str_replace(array_keys($replacements), array_values($replacements), $replacedString); // get your links back 
echo $replacedString; 

Demo

结果

然而
<p> 
The <b>Sarcopterygii</b> or <b>lobe-finned fish</b> – sometimes considered synonymous with <b>Crossopterygii</b> – constitute a <a href="/wiki/Clade" title="Clade">clade</a> of the <a href="/wiki/Osteichthyes" title="Osteichthyes">bony fish</a>, though a strict <a href="/wiki/Cladistic" class="mw-redirect" title="Cladistic">cladistic</a> view includes the terrestrial <a href="/wiki/Vertebrate" title="Vertebrate">vertebrates</a>. 
</p> 

这一点在我看来脆。正如别人在评论中告诉你的,你shouldn't parse HTML with regular expressions。 A lot可以改变,你可以得到意想不到的结果。这可能会让你朝正确的方向。

编辑关于圆括号内的圆括号,您可以使用递归模式。看看this great answer by Bart Kiers

function removeParentheses($content) { 
    $pattern = '@\(([^()]|(?R))*\)@'; 
    $content = preg_replace($pattern, '', $content); 
    $content = str_replace(' .', '.', $content); 
    $content = str_replace(' ', ' ', $content); 
    return $content; 
} 

Demo

+0

为用户请求这并不括号内处理括号的问题使用。只是链接中括号的问题。 https:// 3v4l。org/VDebj – Jeff

+0

@Jeff谢谢。它现在。 – ishegg

相关问题