2011-01-19 76 views
5

我在网站上找到了这个正则表达式。据说这是最好的URL验证表达,我同意。 Diego Perini创造了它。NSRegularExpression来验证URL

我面临的问题是当试图与objective-C一起使用它来检测字符串上的URL。我曾尝试使用像NSRegularExpressionAnchorsMatchLines,NSRegularExpressionIgnoreMetacharacters等其他选项,但仍然没有运气。

表达式Objective-C的格式不正确吗?我错过了什么吗?有任何想法吗?

我也试过约翰格鲁伯的正则表达式,但它失败了,一些无效的URL。

 Regular Expression         Explanation of expression      

^             match at the beginning 
//Protocol identifier 
(?: 
    (?:https?|ftp         http, https or ftp 
    ):\\/\\/          :// 
)?             optional 
// User:Pass authentication 
(?: 
    ^\\s+           non white spaces, 1 or more times 
    (?: 
     :^\\s*          : non white spaces, 0 or more times, optionally 
    )[email protected]            @ 
)?             optional 
//Private IP Addresses        ?! Means DO NOT MATCH ahead. So do not match any of the following 
(?: 
    (?!10           10               10.0.0.0 - 10.999.999.999 
     (?: 
      \\.\\d{1,3}        . 1 to 3 digits, three times 
     ){3} 
    ) 
    (?!127           127               127.0.0.0 - 127.999.999.999 
     (?: 
      \\.\\d{1,3}        . 1 to 3 digits, three times 
     ){3} 
    ) 
    (?!169\\.254         169.254              169.254.0.0 - 169.254.999.999 
     (?: 
      \\.\\d{1,3}        . 1 to 3 digits, two times 
     ){2} 
    ) 
    (?!192\\.168         192.168              192.168.0.0 - 192.168.999.999 
     (?: 
      \\.\\d{1,3}        . 1 to 3 digits, two times 
     ){2} 
    ) 
    (?!172\\.          172.              172.16.0.0 - 172.31.999.999 
     (?:                            
      1[6-9]         1 followed by any number between 6 and 9 
      |          or 
      2\\d         2 and any digit 
      |          or 
      3[0-1]         3 followed by a 0 or 1 
     ) 
     (?: 
      \\.\\d{1,3}        . 1 to 3 digits, two times 
     ){2} 
    ) 
    //First Octet IPv4        // match these. Any non network or broadcast IPv4 address 
    (?: 
     [1-9]\\d?         any number from 1 to 9 followed by an optional digit  1 - 99 
     |           or 
     1\\d\\d          1 followed by any two digits        100 - 199 
     |           or 
     2[01]\\d         2 followed by any 0 or 1, followed by a digit    200 - 219 
     |           or 
     22[0-3]          22 followed by any number between 0 and 3     220 - 223 
    ) 
    //Second and Third Octet IPv4 
    (?: 
     \\.           . 
     (?: 
      1?\\d{1,2}        optional 1 followed by any 1 or two digits     0 - 199 
      |          or 
      2[0-4]\\d        2 followed by any number between 0 and 4, and any digit  200 - 249 
      |          or 
      25[0-5]         25 followed by any numbers between 0 and 5     250 - 255 
     ) 
    ){2}           two times 
    //Fourth Octet IPv4 
    (?: 
     \\.           . 
     (?: 
      [1-9]\\d?        any number between 1 and 9 followed by an optional digit 1 - 99 
      |          or 
      1\\d\\d         1 followed by any two digits        100 - 199 
      |          or 
      2[0-4]\\d        2 followed by any number between 0 and 4, and any digit  200 - 249 
      |          or 
      25[0-4]         25 followed by any number between 0 and 4     250 - 254 
     ) 
    ) 
    //Host name 
    |            or     
    (?: 
     (?: 
      [a-z\u00a1-\uffff0-9]+-?    any letter, digit or character one or more times with optional - 
     )*           zero or more times 
     [a-z\u00a1-\uffff0-9]+      any letter, digit or character one or more times 
    ) 
    //Domain name 
    (?: 
     \\.           . 
     (?: 
      [a-z\u00a1-\uffff0-9]+-?    any letter, digit or character one or more times with optional - 
     )*           zero or more times 
     [a-z\u00a1-\uffff0-9]+      any letter, digit or character one or more times 
    )*            zero or more times 
    //TLD identifier 
    (?: 
     \\.           . 
     (?: 
      [a-z\u00a1-\uffff]{2,}     any letter, digit or character more than two times 
     ) 
    ) 
) 
//Port number 
(?: 
    :\\d{2,5}          : followed by any digit, two to five times, optionally 
)?    
//Resource path 
(?: 
    \\/[^\\s]*         /followed by an optional non space character, zero or more times 
)?             optional 
$             match at the end 

编辑 我想我忘了说,我现在用的是表达以下代码:(部分代码)

NSError *error = NULL; 
NSRegularExpression *detector = [NSRegularExpression regularExpressionWithPattern:[self theRegularExpression] options:0 error:&error]; 
NSArray *links = [detector matchesInString:theText options:0 range:NSMakeRange(0, theText.length)]; 

回答

9
^(?i)(?:(?:https?|ftp):\\/\\/)?(?:\\S+(?::\\S*)[email protected])?(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))(?::\\d{2,5})?(?:\\/[^\\s]*)?$ 

是我发现的最好的URL验证正则表达式,它解释了我的问题。它已经被格式化为在Objective-C上工作。但是,与NSRegularExpression一起使用它给了我各种各样的问题,包括我的应用程序崩溃。 RegexKitLite在处理它时没有问题。我不知道这是一个尺寸限制还是一些没有设置的标志。 我的最终代码看起来像:

//First I take the string and put every word in an array, then I match every word with the regular expression 
NSArray *splitIntoWordsArray = [textToMatch componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceAndNewLineCharacterSet]]; 
NSMutableString *htmlString = [NSMutableString stringWithString:textToMatch]; 
for (NSString *theText in splitIntoWordsArray){ 
    NSEnumerator *matchEnumerator = [theText matchEnumeratorWithRegex:theRegularExpressionString]; 
    for (NSString *temp in matchEnumerator){ 
     [htmlString replaceOccurrencesOfString:temp withString:[NSString stringWithFormat:@"<a href=\"%@\">%@</a>", temp, temp] options:NSLiteralSearch range:NSMakeRange(0, [htmlString length])]; 
    } 
} 
[htmlString replaceOccurrencesOfString:@"\n" withString:@"<br />" options:NSLiteralSearch range:NSMakeRange(0, htmlString.length)]; 
//embed the text on a webView as HTML 
[webView loadHTMLString:[NSString stringWithFormat:embedHTML, [mainFont fontName], [mainFont pointSize], htmlString] baseURL:nil]; 

结果是:UIWebView一些嵌入HTML,其中的网址和电子邮件是可以点击的。不要忘记设置dataDetectorTypes = UIDataDetectorTypeNone

您也可以尝试

NSError *error = NULL; 
NSRegularExpression *expression = [NSRegularExpression regularExpressionWithPattern:@"(?i)(?:(?:https?):\\/\\/)?(?:\\S+(?::\\S*)[email protected])?(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))(?::\\d{2,5})?(?:\\/[^\\s]*)?" options:NSRegularExpressionCaseInsensitive error:&error]; 
if (error) 
    NSLog(@"error"); 
NSString *someString = @"This is a sample of a sentence with a URL http://. http://.. http://../ http://? http://?? http://??/ http://# http://-error-.invalid/ http://-.~_!$&'()*+,;=:%40:80%2f::::::@example.com within it."; 
NSRange range = [expression rangeOfFirstMatchInString:someString options:NSMatchingCompleted range:NSMakeRange(0, [someString length])]; 
if (!NSEqualRanges(range, NSMakeRange(NSNotFound, 0))){ 
    NSString *match = [someString substringWithRange:range]; 
    NSLog(@"%@", match); 
} 
else { 
    NSLog(@"no match"); 
} 

希望它可以帮助别人,将来

正则表达式有时会导致应用程序挂起,所以我决定用格鲁伯的常规修改表达式以识别没有协议或万维网部分的网址:

(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/?)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))*(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])*) 
7

我缺少的东西?

你错过了内置的东西来为你做这个。有一个方便的对象叫做NSDataDetector。你创建它来查找某些数据“类型”(例如,NSTextCheckingTypeLink),然后请求它的-matchesInString:options:range:

Here's an earlier answer of mine showing how to use it

+0

谢谢戴夫为您的快速回答。我曾尝试过,但它不承认一些网址,例如.asia,.info等。这就是当URL不是像http://healthyhomes.asia那样良​​好的结构这就是为什么我使用常规表达。使用在线测试仪,它可以在协议部分检测到healthhomes.asia或info.info。 – GianPac 2011-01-19 18:43:02

+0

@Dave DeLong www.google.c – JAHelia 2016-07-27 11:04:50