检查网址（字符串）

-1

我在将url解析为字符串时遇到了一些麻烦。我需要检查网址是否属于白名单中的域名，但检查失败。我想知道原因，如果我的代码缺乏。检查网址（字符串）

whitelist = [] 
whitelist_file = open(whitelist_file, 'r') 
url = whitelist_file.readline() 
for url in whitelist_file: 
    whitelist = whitelist + [str(url)] 
whitelist_file.close() 

test_file = open(test_file, 'r') 
url_to_check = test_file.readlines() 

for url in url_to_check: 
    for word in whitelist: 
     print(str(word), str(url), word in url) 
     print("-----")

这是上面打印输出（所以你有样品的选中字符串）。你可以看到它失败a2a.eu

a2a.eu 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
ansa.it 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
atlantia.it 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
azimut-group.com 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
a2a.eu 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
ansa.it 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
atlantia.it 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
azimut-group.com 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
a2a.eu 
http://www.a2a.eu/en 
False 
----- 
ansa.it 
http://www.a2a.eu/en 
False 
----- 
atlantia.it 
http://www.a2a.eu/en 
False 
----- 
azimut-group.com 
http://www.a2a.eu/en 
False

感谢

来源

2017-10-20 Fulviooo

您显示的代码似乎不会产生您的问题中的输出。 –

您应该使用urllib.parse模块将域名从网址中取出。然后，您可以根据您的“白人”列表检查每个域名。 –

检查是持续这一个：打印（...，在URL中的文字） – Fulviooo

第5行中的URL包含换行符。呼叫strip（）并且应该修复它：

whitelist = [] 
whitelist_file = open(whitelist_file, 'r') 
url = whitelist_file.readline() 
for url in whitelist_file: 
    whitelist = whitelist + [str(url.strip())] 
    whitelist_file.close() 

test_file = open(test_file, 'r') 
url_to_check = test_file.readlines() 

for url in url_to_check: 
    for word in whitelist: 
    print(str(word), str(url), word in url) 
    print("-----")

来源

2017-10-20 15:31:52 pokiman

非常好。这是解决方案。非常感谢！ – Fulviooo

首先进行，根据您输出一些这个检查应产生真正的结果的情况下。这实际上只是通过输出打印来判断。我怀疑你的url或word（在whilelist中）不是你认为它们的字符串对象;尝试将您的打印语句str作为

print(str(word), str(url), str(word) in str(url))

另外你似乎只是为了检查域，看看urllib的https://docs.python.org/3/library/urllib.html，在那里你可以只剖析网址域部分和对证：

from urllib.parse import urlparse 
    print(str(word), str(url), str(word) in urlparse(str(url)).hostname

来源

2017-10-20 15:22:39

@ Marcel Zoll - 谢谢。我已经尝试了两个建议，但仍然没有成功。看起来它有些不同。这可能是编码的东西？我的意思是Utf8，Ansi，...？ – Fulviooo

检查网址（字符串）

回答

相关问题