2012-03-03 52 views
3

我有一个Glype代理,我不想分析外部URL。网页上的所有网址都会自动转换为:http://proxy.com/browse.php?u=[URL HERE]。例如:如果我访问海盗湾在我代理的话,我想不解析以下网址:Preg-replace - 替换除域和其子域以外的所有URL

ByteLove.com (Not to: http://proxy.com/browse.php?u=http://bytelove.com&b=0) 
BayFiles.com (Not to: http://proxy.com/browse.php?u=http://bayfiles.com&b=0) 
BayIMG.com (Not to: http://proxy.com/browse.php?u=http://bayimg.com&b=0) 
PasteBay.com (Not to: http://proxy.com/browse.php?u=http://pastebay.com&b=0) 
Ipredator.com (Not to: http://proxy.com/browse.php?u=https://ipredator.se&b=0) 
etc. 

我当然想保持内部URL,因此:

thepiratebay.se/browse (To: http://proxy.com/browse.php?u=http://thepiratebay.se/browse&b=0) 
thepiratebay.se/top (To: http://proxy.com/browse.php?u=http://thepiratebay.se/top&b=0) 
thepiratebay.se/recent (To: http://proxy.com/browse.php?u=http://thepiratebay.se/recent&b=0) 
etc. 

有preg_replace替换除了thepiratebay.se之外的所有URL,还有子域名(如示例中所示)?另一个功能也是受欢迎的。 (如DOM文档,QueryPath中,SUBSTR或strpos不str_replace函数,因为那时我应该定义的所有URL)。

我找到了一些东西,但我不熟悉的preg_replace:

$exclude = '.thepiratebay.se'; 
$pattern = '(https?\:\/\/.*?\..*?)(?=\s|$)'; 
$message= preg_replace("~(($exclude)?($pattern))~i", '$2<a href="$4" target="_blank">$5</a>$6', $message); 

回答

1

我猜你会需要提供一个白名单来判断哪些领域应该被代理

$whitelist = array(); 
$whitelist[] = "internal1.se"; 
$whitelist[] = "internal2.no"; 
$whitelist[] = "internal3.com"; 
// and so on... 

$string = '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Fexternal1.com&b=0">External link 1</a><br>'; 
$string .= '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Finternal1.se&b=0">Internal link 1</a><br>'; 
$string .= '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Finternal3.com&b=0">Internal link 2</a><br>'; 
$string .= '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Fexternal2.no&b=0">External link 2</a><br>'; 

//Assuming the URL always is inside '' or "" you can use this pattern: 
$pattern = '#(https?://proxy\.org/browse\.php\?u=(https?[^&|\"|\']*)(&?[^&|\"|\']*))#i'; 

$string = preg_replace_callback($pattern, "my_callback", $string); 

//I had only PHP 5.2 on my server, so I decided to use a callback function. 
function my_callback($match) { 
    global $whitelist; 
    // set return bypass proxy URL 
    $returnstring = urldecode($match[2]); 

    foreach ($whitelist as $white) { 
     // check if URL matches whitelist 
     if (stripos($match[2], $white) > 0) { 
      $returnstring = $match[0]; 
      break; } } 
    return $returnstring; 
} 

echo "NEW STRING[:\n" . $string . "\n]\n"; 
+0

它不工作,这是我的代码:http://pastebin.com/6ML8q7JN URL的位于:$ document – 2012-03-03 18:03:09

+0

我需要查看$ document变量的内容以评估鳕鱼是否可以工作。 – 2012-03-03 18:11:42

+0

它现在正在工作,但_&b = 0_在url后面。如何解决这个问题? – 2012-03-04 15:55:41

1

可以使用preg_replace_callback()为每个匹配执行回调函数。在该函数中,您可以确定是否应该转换匹配的字符串。

<?php 
$string = 'http://foobar.com/baz and http://example.org/bumm'; 
$pattern = '#(https?\:\/\/.*?\..*?)(?=\s|$)#i'; 
$string = preg_replace_callback($pattern, function($match) { 
    if (stripos($match[0], 'example.org/') !== false) { 
     // exclude all URLs containing example.org 
     return $match[0]; 
    } else { 
     return 'http://proxy.com/?u=' . urlencode($match[0]); 
    } 
}, $string); 

echo $string, "\n"; 

(例子是使用PHP 5.3闭符号)