的Python：可以使用里urlparse从CGI斌URL解析域名

我有以下输入字符串：的Python：可以使用里urlparse从CGI斌URL解析域名

/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278

我想用urlparse()到取得网域，但得到的netloc属性回报在这种情况下是一个空字符串。

如何提取域（最佳情况：没有www）？

输出想要的东西： some-super-domain.de

请注意：有时，在上面输入字符串没有WWW！

来源

2014-09-03 nottinhill

您的预期产出是？ – 2014-09-03 09:07:34

some-super-domain.de – nottinhill 2014-09-03 09:22:22

www\.(.*?)\/

此作品参见演示。

http://regex101.com/r/pP3pN1/18

来源

2014-09-03 09:08:47 vks

在您的演示页面上工作，但不在我的脚本中工作，例如：domain = re.match（'^ www \。（。*？）\/$'，line） – nottinhill 2014-09-03 09:23:05

@SirBenBenji你需要生成正则表达式字符串：'domain = re.match（r'^ www \。（。*？）\ /'，line）'，并在结尾处删除'$'。 – Jerry 2014-09-03 09:26:03

这是行得通的。 re.findall（r“www \。（。*？）\ /”，x） – vks 2014-09-03 09:27:31

我觉得urlparse点给你，你可以使用这个你想要什么：

m=re.search(r'(?<=www\.)[a-zA-Z\-]+\.[a-zA-Z]+',s) 
print m.group(0)

结果：

some-super-domain.de

尝试HERE！

所以如果你使用urlparse的结果是这样的：

s='/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278' 

from urlparse import urlparse 
o = urlparse(s) 
print o

结果：

ParseResult(scheme='', netloc='', path='/cgi-bin/ivw/CP/dbb_ug_sp', params='', query='r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278', fragment='')

所以这个结果，你可以访问域与o.query但它是不是你想要的是包含额外的字符！

>>>print o.query 
>>>r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278

来源

2014-09-03 09:25:44 Kasramvd

试试这个代码工作正常：

from urlparse import urlparse 
import urllib 
url = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278'; 
url= url[url.find('http'):] 
url= urllib.unquote(url).decode('utf8') 
result= urlparse(url); 
domain = '{uri.netloc}'.format(uri=result) 
if(domain.find('www.')!=None): 
    domain=domain[4:] 
print (domain);

来源

2014-09-03 09:26:40

太棒了！我在找什么。任何想法如何缓解遇到此错误时的情况：return codecs.utf_8_decode（input，errors，True） UnicodeDecodeError：'utf8'编解码器无法解码位置50中的字节0xe4：无效的继续字节 – nottinhill 2014-09-03 10:01:12

你可以试试下面的代码，它使用可变长度的回顾后，

>>> import regex 
>>> s = "/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278""" 
>>> m = regex.search(r'(?<=https?[^/]*//www\.)[^/]*', s).group() 
>>> m 
'some-super-domain.de'

>>> m = re.search(r'(?<=www\.)[^/]*', s).group() 
>>> m 
'some-super-domain.de'

来源

2014-09-03 09:40:12

import urlparse 
import urllib 

HTTP_PREFIX = 'http://' 
URI = '/cgi-bin/ivw/CP/dbb_ug_sp;?r=http%3A//www.some-super-domain.de/forum/viewtopic.php%3Ff%3D2%26t%3D18564%26start%3D75&d=76756.76050130278' 

# Unquote the HTTP quoted URI 
unquoted_uri = urllib.unquote(URI) 

# Parse the URI to get just the URL in the query 
queryurl = HTTP_PREFIX + unquoted_uri.split(HTTP_PREFIX)[-1] 

# Now you get the hostname you were looking for 
parsed_hostname = urlparse.urlparse(queryurl).netloc

来源

2014-09-03 23:16:47 Saish

的Python：可以使用里urlparse从CGI斌URL解析域名

回答

相关问题