发现锚文本的时候有标签有

下面的文字是我使用查找内容重新串：

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?%s([^'"]*)("|')([^<>]*)>([^<]*))</a>''' % our_url

其结果将是这样的：

r'''(<a([^<>]*)href=("|')(http://)?(www\.)?stackoverflow.com([^'"]*)("|')([^<>]*)>([^<]*))</a>'''

这是伟大的大多数链接，但它与在它的标签的链接错误。

([^<]*))</a>'''

到：我试图改变正则表达式的最后部分

(.*))</a>'''

但是，刚刚得到的链接，这是我不希望以后的页面上的所有内容。我有什么建议可以解决这个问题吗？

来源

2009-03-02 Teifion

相反的：

[^<>]*

尝试：

((?!</a).)*

换句话说，匹配是不是开始的任何字符一个</a序列。

来源

2009-03-02 17:37:13 MarkusQ

非常感谢您的帮助:) – Teifion 2009-03-02 17:45:18

我不会使用正则表达式 - 使用像Beautiful Soup这样的HTML解析器。

来源

2009-03-02 17:32:17

似乎有点重量级这么简单的问题 – Teifion 2009-03-02 17:37:09

从来没有。 HTML非常不规则 - 浏览器需要容忍大量的错误。美丽的汤可以更好地处理不规则的HTML比正则表达式可以。 – 2009-03-02 18:04:05

做一个非贪婪搜索即

(.*?)

来源

2009-03-02 17:32:35

它只能匹配到锚文本内的标记 – Teifion 2009-03-02 17:35:56

>>> import re 
>>> pattern = re.compile(r'<a.+href=[\'|\"](.+)[\'|\"].*?>(.+)</a>', re.IGNORECASE) 
>>> link = '<a href="http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there">Finding anchor text when there are tags there</a>' 
>>> re.match(pattern, link).group(1) 
'http://stackoverflow.com/questions/603199/finding-anchor-text-when-there-are-tags-there' 
>>> re.match(pattern, link).group(2) 
'Finding anchor text when there are tags there'

来源

2009-03-03 00:13:46 riza

发现锚文本的时候有标签有

回答

相关问题