查找并替换一段文本中的URL，返回文本+ URLS列表

我试图找到一种方法来取文本块，用其他文本替换该文本中的所有网址，然后返回新文本大块和它找到的URL列表。喜欢的东西：查找并替换一段文本中的URL，返回文本+ URLS列表

text = """This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol""" 
text, urls = FindURLs(text, "{{URL}}")

应该给：

text = "This is some text {{URL}} blah blah {{URL}} lol" 
urls = ["www.google.com", "http://www.imgur.com/12345.jpg"]

我知道这会涉及到一些正则表达式 - 我发现了一些看似不错的URL检测正则表达式在这里： http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

我敢垃圾与正则表达式，但是，我发现，让它做我想要的python相当棘手。 URL返回的顺序并不重要。

谢谢:)

来源

2011-10-06 combatdave

你能试试我提供的更新正则表达式吗？ – obsoleter

downvoted因为这个问题已被放弃 – obsoleter

如果由于某种原因，你要的网址是有效的格式，使用一些正则表达式的配方。否则，只需分割（）您的文本，循环遍历列表，并且如果一个单词以“www”或“http”开头，则相应地处理它。然后加入（）回你的清单。

text = """This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol""" 
s = text.split() 
urls = [] 
for i in range(len(s)): 
    item = s.pop(0) 
    if item.startswith("www") or item.startswith("http"): 
     s.append("{{URL}}") 
     urls.append(item) 
    else: 
     s.append(item)  

print " ".join([i for i in s]) 
print urls

来源

2011-10-06 15:13:49 hymloth

你将有一个很难找到一个将谷歌的网址不相匹配的方案，但下面将真正的网址的工作：

>>> re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) 
['http://www.imgur.com/12345.jpg']

来源

2011-10-06 15:18:51

正则表达式here应非常自由，足以在没有http或www的情况下抓取网址。

下面是执行文本替换并为您提供结果的列表中的一些简单的Python代码：

import re 

url_regex = re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>\[\]]+|\(([^\s()<>\[\]]+|(\([^\s()<>\[\]]+\)))*\))+(?:\(([^\s()<>\[\]]+|(\([^\s()<>\[\]]+\)))*\)|[^\s`!(){};:'".,<>?\[\]]))""") 

text = "This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol" 
matches = [] 

def process_match(m): 
    matches.append(m.group(0)) 
    return '{{URL}}' 

new_text = url_regex.sub(process_match, text) 

print new_text 
print matches

来源

2011-10-06 15:25:22 obsoleter

我改变了正则表达式到以下为了使它不接受短语，如“编辑：你好”作为URL： “”“（？i）\ b（（?: （FTP | HTTPS）：WWW \ d {0,3} | [A-Z0-9 \ - ] +：（/ {1,3} | | [A-Z0-9％]）[。] [AZ] {2,4} /）。]（？：[^ \ S（）<>] + | \（（[^ \ S（）<>] + |（\（[^ \（）的<>] + \）））* \））+（？：\（（[^ \ S（）<>] + |（\（[^ \ S（）<>] + \）））* \） | [^ \ s'！（）\ [\] {};：'“。，<>？]））”“”“ – combatdave

未接受此答案 - 对以下文本无效： '[http：// www .google.com]（http://www.google.com）' 给出： ''[{{URL}}'，['http://www.google.com]（http：// www。 google.com）']' 我太垃圾在正则表达式找出问题：/ – combatdave

所以，我假设你正试图解析一些降价文本？ – obsoleter

这就是我正在做它：

urlpattern = re.compile(r"""(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))""")  

def urlify(value): 
    return urlpattern.sub(r'<a href="\1">\1</a>', value)

用法：

>>> urlify('DuckDuckGo https://duckduckgo.com, the search engine that doesn\'t track you') 
'Duckduckgo <a href="https://duckduckgo.com">https://duckduckgo.com</a>, the search engine that doesn\'t track you'

从https://daringfireball.net/2010/07/improved_regex_for_matching_urls复制过来的正则表达式。

来源

2017-11-10 10:59:27 semente

查找并替换一段文本中的URL，返回文本+ URLS列表

回答

相关问题