如何使用正则表达式来捕获正确的重复组？

包含2条彼此相距很远的html，如下所示。请注意，在这两行的开头有两个相同的字符串。如何使用正则表达式来捕获正确的重复组？

<a href="http://example.com">file 1</a><a href="http://example.com/right2">file 2</a><a href="http://example.com/right3">file 3</a> 

<a href="http://example.com">file 1</a><a href="http://example.com/left2">file 2</a><a href="http://example.com/left3">file 3</a>

我想正则表达式来给我只能从上面的第一线，这是

http://example.com 
http://example.com/right2 
http://example.com/right3 

file 1 
file 2 
file 3

如果我用这个正则表达式

re.compile('<a href="(.+?)">(.+?)</a>').findall()

然后，我有

结果

http://example.com 
http://example.com/right2 
http://example.com/right3 
http://example.com 
http://example.com/left2 
http://example.com/left3 

file 1 
file 2 
file 3 
file 1 
file 2 
file 3

请帮忙。谢谢。

来源

2014-10-05 H123

我是一个新手，还需要学习了很多有关正则表达式。美丽的汤是我未知的领域之一，但我会稍后看看。无论如何，如何才能快速修复上面的正则表达式以获得我的结果？ – H123 2014-10-05 17:06:57

为什么你不试试这个正则表达式http://regex101.com/r/kE0wF3/2？ – 2014-10-05 17:09:35

到目前为止，你还没有显示什么是可变/固定的正则表达式。 – sln 2014-10-05 22:22:37

保存href值。如果发现重复的属性值停止：

>>> import re 
>>> matches = re.findall('<a href="(.+?)">(.+?)</a>', html_string) 
>>> seen = set() 
>>> for href, text in matches: 
...  if href in seen: 
...   break 
...  seen.add(href) 
...  print('{} {}'.format(href, text)) 
... 
http://example.com file 1 
http://example.com/right2 file 2 
http://example.com/right3 file 3

使用Beautiful Soup：

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_string) 
seen = set() 
for tag in soup.select('a[href]'): 
    if tag['href'] in seen: 
     break 
    seen.add(tag['href']) 
    print('{} {}'.format(tag['href'], tag.text))

来源

2014-10-05 17:02:34 falsetru

如何使用正则表达式来捕获正确的重复组？

回答

相关问题