2017-05-05 91 views
-4

我想通过一个具有多个锚标记的html字符串运行正则表达式,并构建链接文本字典与其href url。正则表达式来匹配锚标记和它的href

<p>This is a simple text with some embedded <a href="http://example.com/link/to/some/page?param1=77&param2=22">links</a>. This is a <a href="https://exmp.le/sample-page/?uu=1">different link</a>.

如何提取一气呵成<a>标签的文字和HREF?

编辑:

func extractLinks(html: String) -> Dictionary<String, String>? { 

    do { 
     let regex = try NSRegularExpression(pattern: "/<([a-z]*)\b[^>]*>(.*?)</\1>/i", options: []) 
     let nsString = html as NSString 
     let results = regex.matchesInString(html, options: [], range: NSMakeRange(0, nsString.length)) 
     return results.map { nsString.substringWithRange($0.range)} 
    } catch let error as NSError { 
     print("invalid regex: \(error.localizedDescription)") 
     return nil 
    } 
} 
+1

你的正则表达式代码在哪里? – matt

+0

@matt:他们在等你写它。 –

+0

它非常糟糕。 – Rao

回答

1

首先,你需要学习NSRegularExpressionpattern的基本语法:

  • pattern不包含分隔符
  • pattern不含改性剂,你需要通过如下信息options
  • 当你wa nt使用元字符\,则需要在Swift字符串中将其转义为\\

因此,创造NSRegularExpression实例的行应该是这样的:

let regex = try NSRegularExpression(pattern: "<([a-z]*)\\b[^>]*>(.*?)</\\1>", options: .caseInsensitive) 

但是,正如你可能已经知道,你的模式不包含任何代码以匹配href或捕获它的价值。

像这样的你的榜样html工作:

let pattern = "<a\\b[^>]*\\bhref\\s*=\\s*(\"[^\"]*\"|'[^']*')[^>]*>((?:(?!</a).)*)</a\\s*>" 
let regex = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive) 
let html = "<p>This is a simple text with some embedded <a\n" + 
    "href=\"http://example.com/link/to/some/page?param1=77&param2=22\">links</a>.\n" + 
    "This is a <a href=\"https://exmp.le/sample-page/?uu=1\">different link</a>." 
let matches = regex.matches(in: html, options: [], range: NSRange(0..<html.utf16.count)) 
var resultDict: [String: String] = [:] 
for match in matches { 
    let hrefRange = NSRange(location: match.rangeAt(1).location+1, length: match.rangeAt(1).length-2) 
    let innerTextRange = match.rangeAt(2) 
    let href = (html as NSString).substring(with: hrefRange) 
    let innerText = (html as NSString).substring(with: innerTextRange) 
    resultDict[innerText] = href 
} 
print(resultDict) 
//->["different link": "https://exmp.le/sample-page/?uu=1", "links": "http://example.com/link/to/some/page?param1=77&param2=22"] 

记住,我的pattern上面可能错误地检测到病态的一个标签或错过一些嵌套结构,也缺乏特色与HTML字符的工作实体...

如果你想让你的代码更健壮和通用,你最好考虑采用ColGraff和Rob建议的HTML解析器。