PHP从HTML文本

在我的代码都有以下正则表达式得到值：PHP从HTML文本

preg_match_all('/<title>([^>]*)<\/title>/si', $contents, $match);

，从网页检索<h>..</h>标签。但有时也可能有html标签，如<strong>，<b>等等等等因此，它需要做一些修改，因此我想这一个

preg_match_all('/<h[1-6]>(.*)<\/h[1-6]>/si', $contents, $match);

可是，我错了，不检索是HTML标签<h>内容。

你能帮我正确修改正则表达式吗？

来源

2016-05-13 Dimitrios Desyllas

[试过用DOM解析器吗？]（http://stackoverflow.com/a/1732454/511529） – GolezTrol

如果'h's有任何属性会失败。 '。*'也是贪婪的，如果你有一个以上的页面，它会吃掉所有东西。解析器是你最好的方法。看看http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php – chris85

正如它在其他文章中所说，不要使用正则表达式解析HTML，除非你的HTML很简单，并且你不需要搜索嵌套标签。即便如此，糟糕的主意。有一些DOM解析器（[DOMDocument]（https://php.net/domdocument））用于解析HTML，并且很容易处理。他们有几种可用于JS的相同方法，比如'getElementsByTagName'，可用于查找每个''标签。 –

preg_match_all('<h\d>', $contents, $matches); 

foreach($matches as $match){ 
$num[] = substr ($match , 1 , 1); 
}

来源

2016-05-13 23:43:43 xpeiro

当使用(.*)你采取的一切，对于只是文字，数字和空间，也许你可以使用一系列与他们采取的一种或多种：

preg_match_all('/<h[1-6]>([\w\d\s]+)<\/h[1-6]>/si', $contents, $match);

来源

2016-05-13 21:52:55

现在，这里没有正则表达式的专家，但他应该是在你的鞋子里;他会做它像这样：

<?php 

     // SIMULATED SAMPLE HTML CONENT - WITH ATTRIBUTES: 
     $contents = '<section id="id-1">And even when darkness covers your path and no one is there to lend a hand; 
      <h3 class="class-1">Always remember that <em>There is always light at the end of the Tunnel <span class="class-2">if you can but hang on to your Faith!</span></em></h3> 
      <div>Now; let no one deceive you: <h2 class="class-2">You will be tried in ever ways - sometimes beyond your limits...</h2></div> 
      <article>But hang on because You are the Voice... You are the Light and you shall rule your Destiny because it is all about<h6 class="class4">YOU - THE REAL YOU!!!</h6></article> 
      </section>'; 

     // SPLIT THE CONTENT AT THE END OF EACH <h[1-6]> TAGS 
     $parts  = preg_split("%<\/h[1-6]>%si", $contents); 
     $matches = array(); 

     // LOOP THROUGH $parts AND BUNDLE APPROPRIATE ELEMENTS TO THE $matches ARRAY.  
     foreach($parts as $part){ 
      if(preg_match("%(.*|.?)(<h)([1-6])%si", $part)){ 
       $matches[] = preg_replace("%(.*|.?)(<)(h[1-6])(.*)%si", "$2$3$4$2/$3>", $part); 
      } 
     } 
     var_dump($matches); 


     //DUMPS:::: 
     array (size=3) 
      0 => string '<h3 class="class-1">Always remember that <em>There is always light at the end of the Tunnel <span class="class-2">if you can but hang on to your Faith!</span></em></h3>' (length=168) 
      1 => string '<h2 class="class-2">You will be tried in ever ways - sometimes beyond your limits...</h2>' (length=89) 
      2 => string '<h6 class="class4">YOU - THE REAL YOU!!!</h6>' (length=45)

作为一个功能，这是它归结为：

<?php 

     function pseudoMatchHTags($htmlContentWithHTags){ 
      $parts  = preg_split("%<\/h[1-6]>%si", $htmlContentWithHTags); 
      $matches = array(); 
      foreach($parts as $part){ 
       if(preg_match("%(.*|.?)(<h)([1-6])%si", $part)){ 
        $matches[] = preg_replace("%(.*|.?)(<)(h[1-6])(.*)%si", "$2$3$4$2/$3>", $part); 
       } 
      } 
      return $matches; 
     } 

     var_dump(pseudoMatchHTags($contents));

你可以在这里进行测试：https://eval.in/571312 ...也许它可以帮助一个bit ...我希望... ;-)

来源

2016-05-13 22:45:47 Poiz

PHP从HTML文本

回答

相关问题