2009-10-05 141 views
2

Hello i want to extract links <a href="/portal/clients/show/entityId/2121" > and i want a regex which givs me /portal/clients/show/entityId/2121 the number at last 2121 is in other links different any idea?Preg_match_all <a href

+0

你想使用正则表达式从'/ portal/clients/show/entityId/2121'中提取'2121'吗? – halocursed 2009-10-05 12:11:00

+0

不,我想提取'/门户/客户端/显示/ entityId/2121' 另一个链接可以有不同的数字,而不是2121任何想法? – streetparade 2009-10-05 12:13:19

回答

0

正则表达式解析链接是这样的:

'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i' 

既然是多么的可怕,我会建议使用Simple HTML Dom至少得到链接。然后你可以在链接href中使用一些非常基本的正则表达式来检查链接。

+0

@streetparade您可能希望避免在捕获的值中包含引用属性值的引号,因此,请相应地调整正则表达式捕获相关: '/ ] + | “[^”] * “| \ '[^ \'] * \')* HREF = ”([^“] +)” | \ '[^ \'] + \'| [^ <> \ s]的+/I” – 2014-08-28 16:56:32

9

Simple PHP HTML Dom Parser例如:

// Create DOM from string 
$html = str_get_html($links); 

//or 
$html = file_get_html('www.example.com'); 

foreach($html->find('a') as $link) { 
    echo $link->href . '<br />'; 
} 
+0

这会给结果“ – streetparade 2009-10-05 12:26:21

+0

但我只是提取/门户/客户端/显示/ entityId/4636所以这工作 '/ ] + |”[^“] *”|'[^'] *' )* href =(“[^”] +“|'[^'] +'| [^ <> \ s] +)/ i' – streetparade 2009-10-05 12:26:57

+0

@streetparade my bad,忘记说$ link-> href,编辑 – karim79 2009-10-05 12:30:13

4

Don't use regular expressions for proccessing xml/html。这可以很容易地使用来完成的builtin dom parser

$doc = new DOMDocument(); 
$doc->loadHTML($htmlAsString); 
$xpath = new DOMXPath($doc); 
$nodeList = $xpath->query('//a/@href'); 
for ($i = 0; $i < $nodeList->length; $i++) { 
    # Xpath query for attributes gives a NodeList containing DOMAttr objects. 
    # http://php.net/manual/en/class.domattr.php 
    echo $nodeList->item($i)->value . "<br/>\n"; 
} 
0

这是我的解决方案:

<?php 
// get links 
$website = file_get_contents("http://www.example.com"); // download contents of www.example.com 
preg_match_all("<a href=\x22(.+?)\x22>", $website, $matches); // save all links \x22 = " 

// delete redundant parts 
$matches = str_replace("a href=", "", $matches); // remove a href= 
$matches = str_replace("\"", "", $matches); // remove " 

// output all matches 
print_r($matches[1]); 
?> 

我建议避免使用基于XML解析器,因为你不会总是知道, 文档是否/网站已经形成良好。

祝你好运