正则表达式来删除外部链接与出文字

-1

This is a <a href="https://www.test.com">test1</a>. This is <a href="/node/1">test2</a>. This is <a href="https://nct.com">test3</a>. This is a <a href="www.test.com">test4</a>. This is a <a href="http://test.com">test5</a>.

nct.com是我的网站。我不想删除包含在标签内的链接和文本。那么/ node/1。

我期待的输出是

This is a test1. This is <a href="/node/1">test2</a>. This is <a href="https://nct.com">test3</a>. This is a test4. This is a test5.

凡为外部网站如test.com，我想一个标签内容不去除包装标签中的文本。我使用

钍正则表达式是

#<a [^>]*\bhref=(['"])http.?://((?<!mywebsite)[^'"])+\1 *.*?</a>#i

这将删除标记内容以及在标签中的文本。

来源

2017-10-11 Fazeela Abu Zohra

你需要nct.com和/ node/1在正则表达式中硬编码还是只有url没有http（s）？ – Wouter

我创建了一个正则表达式，做什么，我想你需要：

/<a [^>]*\bhref=(['"])((https?:\/\/|www.)((?!nct\.com).)(.*?))['"]*\b<\/a>/

test

来源

2017-10-11 13:33:37 Wouter

正则表达式不适合我。我已经更新了这个问题，可否请你帮我解决。 –

@FazeelaAbuZohra我更新了正则表达式（和测试网址），它不是最干净的一个，但匹配更新后的问题中的所有无效网址。 – Wouter

你可以试试这个：

import re 
s = 'This is a <a href="https://www.test.com">test1</a>. This is <a href="/node/1">test2</a>. This is <a href="https://nct.com">test3</a>. This is a <a href="www.test.com">test4</a>. This is a <a href="http://test.com">test5</a>.' 
final_list = [re.findall("^[a-zA-Z\s]+", i)[0]+re.findall('com">(.*?)</a>', i)[0] if "nct.com" not in i and "node" not in i else i for i in re.split("\.\s(?=This)", s)]

输出：

['This is a test1', 'This is <a href="/node/1">test2</a>', 'This is <a href="https://nct.com">test3</a>', 'This is a test4', 'This is a test5']

来源

2017-10-21 21:41:29 Ajax1234

正则表达式来删除外部链接与出文字

回答

相关问题