2017-10-20 117 views
-1

我在将url解析为字符串时遇到了一些麻烦。我需要检查网址是否属于白名单中的域名,但检查失败。我想知道原因,如果我的代码缺乏。检查网址(字符串)

whitelist = [] 
whitelist_file = open(whitelist_file, 'r') 
url = whitelist_file.readline() 
for url in whitelist_file: 
    whitelist = whitelist + [str(url)] 
whitelist_file.close() 

test_file = open(test_file, 'r') 
url_to_check = test_file.readlines() 

for url in url_to_check: 
    for word in whitelist: 
     print(str(word), str(url), word in url) 
     print("-----") 

这是上面打印输出(所以你有样品的选中字符串)。你可以看到它失败a2a.eu

a2a.eu 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
ansa.it 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
atlantia.it 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
azimut-group.com 
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html 
False 
----- 
a2a.eu 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
ansa.it 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
atlantia.it 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
azimut-group.com 
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa 
False 
----- 
a2a.eu 
http://www.a2a.eu/en 
False 
----- 
ansa.it 
http://www.a2a.eu/en 
False 
----- 
atlantia.it 
http://www.a2a.eu/en 
False 
----- 
azimut-group.com 
http://www.a2a.eu/en 
False 

感谢

+0

您显示的代码似乎不会产生您的问题中的输出。 –

+0

您应该使用urllib.parse模块将域名从网址中取出。然后,您可以根据您的“白人”列表检查每个域名。 –

+0

检查是持续这一个:打印(...,在URL中的文字) – Fulviooo

回答

0

第5行中的URL包含换行符。呼叫strip()并且应该修复它:

whitelist = [] 
whitelist_file = open(whitelist_file, 'r') 
url = whitelist_file.readline() 
for url in whitelist_file: 
    whitelist = whitelist + [str(url.strip())] 
    whitelist_file.close() 

test_file = open(test_file, 'r') 
url_to_check = test_file.readlines() 

for url in url_to_check: 
    for word in whitelist: 
    print(str(word), str(url), word in url) 
    print("-----") 
+0

非常好。这是解决方案。非常感谢! – Fulviooo

0

首先进行,根据您输出一些这个检查应产生真正的结果的情况下。这实际上只是通过输出打印来判断。我怀疑你的url或word(在whilelist中)不是你认为它们的字符串对象;尝试将您的打印语句str作为

print(str(word), str(url), str(word) in str(url)) 

另外你似乎只是为了检查域,看看urllib的https://docs.python.org/3/library/urllib.html,在那里你可以只剖析网址域部分和对证:

from urllib.parse import urlparse 
    print(str(word), str(url), str(word) in urlparse(str(url)).hostname 
+0

@ Marcel Zoll - 谢谢。我已经尝试了两个建议,但仍然没有成功。看起来它有些不同。这可能是编码的东西?我的意思是Utf8,Ansi,...? – Fulviooo