2016-05-13 49 views
0

在我的代码都有以下正则表达式得到 值:PHP从HTML文本

preg_match_all('/<title>([^>]*)<\/title>/si', $contents, $match); 

,从网页检索<h>..</h>标签。但有时也可能有html标签,如<strong><b>等等等等因此,它需要做一些修改,因此我想这一个

preg_match_all('/<h[1-6]>(.*)<\/h[1-6]>/si', $contents, $match); 

可是,我错了,不检索是HTML标签<h>内容。

你能帮我正确修改正则表达式吗?

+7

[试过用DOM解析器吗?](http://stackoverflow.com/a/1732454/511529) – GolezTrol

+4

如果'h's有任何属性会失败。 '。*'也是贪婪的,如果你有一个以上的页面,它会吃掉所有东西。解析器是你最好的方法。看看http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php – chris85

+1

正如它在其他文章中所说,不要使用正则表达式解析HTML,除非你的HTML很简单,并且你不需要搜索嵌套标签。即便如此,糟糕的主意。有一些DOM解析器([DOMDocument](https://php.net/domdocument))用于解析HTML,并且很容易处理。他们有几种可用于JS的相同方法,比如'getElementsByTagName',可用于查找每个''标签。 –

回答

1
preg_match_all('<h\d>', $contents, $matches); 

foreach($matches as $match){ 
$num[] = substr ($match , 1 , 1); 
} 
0

当使用(.*)你采取的一切,对于只是文字,数字和空间,也许你可以使用一系列与他们采取的一种或多种:

preg_match_all('/<h[1-6]>([\w\d\s]+)<\/h[1-6]>/si', $contents, $match); 
0

现在,这里没有正则表达式的专家,但他应该是在你的鞋子里;他会做它像这样:

<?php 

     // SIMULATED SAMPLE HTML CONENT - WITH ATTRIBUTES: 
     $contents = '<section id="id-1">And even when darkness covers your path and no one is there to lend a hand; 
      <h3 class="class-1">Always remember that <em>There is always light at the end of the Tunnel <span class="class-2">if you can but hang on to your Faith!</span></em></h3> 
      <div>Now; let no one deceive you: <h2 class="class-2">You will be tried in ever ways - sometimes beyond your limits...</h2></div> 
      <article>But hang on because You are the Voice... You are the Light and you shall rule your Destiny because it is all about<h6 class="class4">YOU - THE REAL YOU!!!</h6></article> 
      </section>'; 

     // SPLIT THE CONTENT AT THE END OF EACH <h[1-6]> TAGS 
     $parts  = preg_split("%<\/h[1-6]>%si", $contents); 
     $matches = array(); 

     // LOOP THROUGH $parts AND BUNDLE APPROPRIATE ELEMENTS TO THE $matches ARRAY.  
     foreach($parts as $part){ 
      if(preg_match("%(.*|.?)(<h)([1-6])%si", $part)){ 
       $matches[] = preg_replace("%(.*|.?)(<)(h[1-6])(.*)%si", "$2$3$4$2/$3>", $part); 
      } 
     } 
     var_dump($matches); 


     //DUMPS:::: 
     array (size=3) 
      0 => string '<h3 class="class-1">Always remember that <em>There is always light at the end of the Tunnel <span class="class-2">if you can but hang on to your Faith!</span></em></h3>' (length=168) 
      1 => string '<h2 class="class-2">You will be tried in ever ways - sometimes beyond your limits...</h2>' (length=89) 
      2 => string '<h6 class="class4">YOU - THE REAL YOU!!!</h6>' (length=45) 

作为一个功能,这是它归结为:

<?php 

     function pseudoMatchHTags($htmlContentWithHTags){ 
      $parts  = preg_split("%<\/h[1-6]>%si", $htmlContentWithHTags); 
      $matches = array(); 
      foreach($parts as $part){ 
       if(preg_match("%(.*|.?)(<h)([1-6])%si", $part)){ 
        $matches[] = preg_replace("%(.*|.?)(<)(h[1-6])(.*)%si", "$2$3$4$2/$3>", $part); 
       } 
      } 
      return $matches; 
     } 

     var_dump(pseudoMatchHTags($contents)); 

你可以在这里进行测试:https://eval.in/571312 ...也许它可以帮助一个bit ...我希望... ;-)